Updated 2026-05-31 · 59 models tracked

AI Models Leaderboard

Compare 59 large language models on benchmarks, pricing, context window, and throughput. Sort and filter the table, see the price-vs-performance frontier in a glance, and use the cost calculator to estimate what each model will actually cost you in production.

License All Open Closed
Context All ≥ 32K ≥ 128K ≥ 200K ≥ 1M
Search
Model Vendor Score Ctx In $/M Out $/M MMLU-Pro GPQA SWE-bench

Benchmark Leaders

Price vs Performance

X: blended $/M (log). Y: composite benchmark score. Pareto-optimal models are highlighted.

Cost Calculator

Track real spend per provider with Helicone or Langfuse. Need access to every model under one bill? OpenRouter routes to all of them.

How to read this leaderboard

The composite score is the simple mean of every published benchmark for a model. We do not normalise across benchmarks because they are already roughly comparable in their published units (most are 0-100 accuracy). Where a vendor has not reported a benchmark, the cell is empty and the composite excludes it — which means a model with fewer reported benchmarks is comparable to one with more, but tied scores should be read carefully.

The blended price column is a 3:1 input-to-output weighted average, which approximates the cost shape of most production agent workloads (heavy input from context and history, smaller output per call). Prompt caching changes this calculus substantially — Anthropic and OpenAI both ship aggressive caching that can drop effective price by 60-80% on repeated-context workloads. If your application is conversational or RAG-heavy, the cached-read price is the one you should optimise against, not the headline input price.

Which benchmark actually matters

MMLU-Pro measures broad knowledge across 14 disciplines. Useful as a general capability indicator; weak as a predictor of agent performance.

GPQA Diamond measures graduate-level science reasoning on questions verified to be hard for non-experts. A reliable indicator of how a model handles novel reasoning rather than retrieved facts.

HumanEval is the original "synthesise a Python function from a docstring" benchmark. Saturated for frontier models; treat scores >85% as a baseline competency check, not a differentiator.

SWE-bench Verified measures end-to-end issue resolution on real GitHub repositories — closing tickets, not solving toy problems. The single most predictive benchmark for AI coding tool performance in our experience.

MATH-500 measures competition mathematics. Saturated for reasoning-focused models (o3, R1) and a strong indicator of chain-of-thought capability.

MMMU measures multimodal reasoning across images and text. Only matters if your application uses vision.

Aider Polyglot measures multi-language code editing across Python, Go, Rust, TypeScript, JavaScript, and C++. Practical and harder to game than HumanEval.

tau-bench measures tool-use in conversational agent settings (retail and airline scenarios). The best published proxy for "how good is this model as an agent" until a better benchmark replaces it.

Picking a model for production

Three rules of thumb after spending the past 12 months testing most models on this list in production-shaped workloads:

1. Start cheaper, upgrade only on observed regression. Most agent flows do not need a frontier model. We default to Claude Sonnet 4.6 or DeepSeek V3.1 and only move to Opus / GPT-5 / o3 when we see a quality cliff. The cost difference compounds fast at production volume.

2. Optimise for cached-read price, not headline price. If your workload has any repeated context (system prompts, conversation history, RAG corpora), the cached-read price column dominates monthly spend. Anthropic and OpenAI both offer caching; Google's flash variants have implicit caching as well.

3. Benchmark your own task before committing. Run the cheapest plausible candidate on 30 examples of your real task. The leaderboard tells you which models to consider; only your eval set tells you which model to ship.

Methodology and sources

Benchmark numbers are aggregated from vendor model cards, the LMSYS Chatbot Arena leaderboard, Artificial Analysis, the HuggingFace Open LLM Leaderboard, the Aider leaderboard, and SWE-bench Verified. Pricing is taken from each provider's first-party API on 2026-05-31; open-weights pricing reflects the typical OpenRouter / Together / Fireworks rate. We update the data monthly and bump the timestamp in this page's header on every refresh.

For the long-form analysis behind these numbers — which models we use day-to-day, which we ruled out, and how the landscape shifted in the last six months — read the companion post LLM Benchmark Comparison 2026.

0 selected · Compare →