LLM Benchmark Comparison 2026 — Pricing, Performance, and Pareto Winners

Q: Why is HumanEval saturated?

Every frontier model now scores above 90 and the remaining variation is below the noise floor of the benchmark itself.

Q: Why aren't legacy models like Claude 3.5 Sonnet on the list?

We drop models superseded by a newer same-vendor model in the same tier. Claude 3.5 Sonnet was replaced by Claude Sonnet 4.6.

Parvez Ahmed

May 21, 2026 · Updated Jun 12, 2026

The interactive Leaderboard on this site tracks 50 large language models across pricing, benchmarks, context windows, and throughput. This post is the long-form analysis sitting on top of the same dataset — which numbers actually matter, where the field has shifted in the past six months, and which models we recommend for which workloads. If you only need the table, the leaderboard is faster. If you want to know what the table is telling you, read on.

The 2026 landscape in one paragraph

The frontier is now genuinely tight. Claude Opus 4.7, GPT-5, Grok 4, and Gemini 2.5 Pro are within margin-of-error on most published benchmarks, and on real production-shaped tasks the choice mostly comes down to ergonomics and price rather than raw capability. Reasoning models — o3, o4-mini, DeepSeek R1 — own the math and graduate-reasoning benchmarks but cost more per token and run slower. Open weights have closed most of the visible gap: DeepSeek V3.1, Qwen3 235B, Llama 4 Maverick, and Kimi K2 are now credible alternatives to the closed frontier for any team willing to host (or pay a hosting provider). The cheap end has also shifted: Gemini 2.5 Flash-Lite, GPT-5 nano, and Qwen3 7B now sit at 5-10x cheaper than the cheap-frontier models of a year ago while losing far less than 5-10x in capability.

Which benchmark actually matters

The number of benchmarks reported on a model card is now part of the marketing. Most of them are saturated, gameable, or not predictive of anything you care about. The list below is what we actually look at when we choose a model, in priority order.

SWE-bench Verified is the most predictive single benchmark for AI coding tool performance. It measures whether the model can resolve real GitHub issues end-to-end, not whether it can write a function from a docstring. Anything above 70 is frontier; anything above 50 is shippable for most AI-coding workloads; anything below 35 will frustrate you on real code.

GPQA Diamond is the best public proxy for novel reasoning rather than retrieved knowledge. The questions are verified hard for non-experts and resistant to web-scrape contamination. Reasoning-focused models (o3, R1) score significantly higher here than non-reasoning peers, and the gap is informative — it tells you which models are doing chain-of-thought versus pattern-matching.

Aider Polyglot measures multi-language code editing — the same thing AI coding tools do all day. Harder to game than HumanEval, and the polyglot constraint (Python, Go, Rust, TS, JS, C++) means a model has to actually understand multiple languages rather than memorise Python idioms.

tau-bench is the best published agent-tools benchmark. It measures whether a model can use tools correctly in a multi-turn agentic conversation. Imperfect, but the only public benchmark of its kind that has not been completely gamed yet.

MMMU matters only if you use vision. For text-only agents, ignore it.

MMLU-Pro is fine as a sanity check but no longer differentiates frontier models — everyone is above 80, and the remaining variation is mostly noise.

HumanEval is saturated. Treat anything above 85 as baseline competency, not as a differentiator.

MATH-500 is saturated for reasoning models. Useful as a check on whether a model has chain-of-thought as a capability; less useful as a fine-grained comparator.

A model that scores 88 on MMLU-Pro and 35 on SWE-bench Verified is not a good agent model, regardless of marketing. A model that scores 80 on MMLU-Pro and 70 on SWE-bench Verified is.

The 5 cheapest models that do not suck

Sorted by blended $/M tokens (3:1 input:output), filtered to models with composite ≥ 55:

#	Model	Blended $/M	Composite	Why it’s on this list
1	Gemini 2.5 Flash-Lite	$0.13	55–60	Cheapest credible model. Great for high-volume extraction.
2	Qwen3 7B	$0.06	~50	Open-weights, self-hostable, surprisingly capable for the size.
3	Llama 4 Scout	$0.17	~52	10M context window at this price is unique.
4	GPT-5 nano	$0.70	~55	OpenAI ecosystem, useful when GPT-5 is overkill.
5	Gemma 3 12B	$0.06	~50	Self-host friendly, decent multimodal.

For most production agent workloads, Claude Haiku 4.5 ($2.00 blended, composite ~65) is the sweet spot: capable enough to ship most agent flows, cheap enough that token cost stops being the bottleneck. We use it as the default model on roughly half of our internal pipelines.

Best open-weights model right now

DeepSeek V3.1 wins for general-purpose agent workloads. Composite is competitive with Claude Sonnet 4.6 on most benchmarks, pricing through hosted providers is roughly an order of magnitude cheaper than the closed frontier, and the context window (128K) is sufficient for any realistic agent loop. The weakness is throughput — about 60 tokens/sec on the major hosts versus 95 on Claude Sonnet — which matters for latency-sensitive applications.

Kimi K2 is the dark-horse pick. The benchmarks are competitive with DeepSeek V3.1 and the 200K context window is larger. Available through Moonshot directly and via the usual hosts.

Qwen3 235B is the right pick if you have specific needs around multilingual or Chinese-language workloads where Qwen’s training data is strongest.

For self-hosting, Llama 4 Maverick is the easiest model on this list to deploy on a single 8x H100 node and the benchmark numbers are competitive at that scale. Llama 4 Scout is the right pick if you genuinely need the 10M context window — no closed model offers that today.

Pair any of the above with Together AI or Fireworks AI if you want hosted access, or OpenRouter for unified routing across providers.

Best closed model per task

Coding (autonomous): Claude Opus 4.7. SWE-bench leadership, longest-tested agent capability.
Coding (interactive): Claude Sonnet 4.6 or GPT-5 mini. Faster, cheaper, comparable quality on shorter tasks.
Multi-step agent loops: Claude Sonnet 4.6 or Opus 4.7. Tool-use reliability and prompt caching matter more than raw benchmark score.
Reasoning-heavy (math, theory): o3 or DeepSeek R1. The reasoning-trained models open a real gap here.
Multimodal: Gemini 2.5 Pro. MMMU lead, 1M context, native video.
Long-context (>500K): Gemini 2.5 Pro or Llama 4 Scout. Both genuinely handle >500K; everyone else degrades.
Cost-sensitive production: Claude Haiku 4.5 or Gemini 2.5 Flash. 80% of the capability at 10% of the cost.

The Pareto frontier

The price-vs-performance scatter on the Leaderboard highlights the Pareto-optimal models — the ones that are not dominated by any cheaper alternative at equal-or-higher score. As of May 2026, the frontier consists of:

Free-tier compute (open-weights, self-hosted): Qwen3 7B → Gemma 3 27B → Llama 4 Maverick → DeepSeek V3.1.
Hosted, mid-range: Gemini 2.5 Flash-Lite → Gemini 2.5 Flash → Claude Haiku 4.5 → Claude Sonnet 4.6.
Hosted, frontier: Gemini 2.5 Pro → Claude Opus 4.7 / GPT-5.

The interesting cluster is mid-range — Gemini 2.5 Flash, Claude Haiku 4.5, and DeepSeek V3.1 trade places depending on workload shape, and we have shipped agents on all three. The choice usually comes down to which provider you have institutional trust with, not which has a 2-point benchmark edge.

How to read this leaderboard if you’re choosing a model for production

Three rules we apply every time we pick a model for a real workload, in order:

1. Define your eval set first. The leaderboard helps you narrow from 50 models to 5 candidates. Your eval set narrows from 5 to 1. We run every candidate on 30-50 examples of the real task before committing. The eval set is the most valuable artifact a serious AI engineering team owns.

2. Optimise for the cached-read price column, not the headline price. Most production workloads have substantial repeated context — system prompts, retrieval contexts, conversation history. Anthropic’s prompt caching drops effective input cost by ~80% on those workloads; OpenAI’s automatic caching is similar; Google’s flash models have implicit caching too. If your workload looks anything like this, the cached-read price is your real cost.

3. Latency matters more than benchmarks for interactive workloads. A 90-percentile latency of 3 seconds versus 8 seconds is the difference between a usable product and an unusable one. Throughput numbers on the leaderboard are a starting point; measure the latency of the actual model on your actual task before you decide.

Methodology

Benchmark numbers come from vendor model cards, the LMSYS Chatbot Arena, Artificial Analysis, HuggingFace’s Open LLM Leaderboard, the Aider leaderboard, and SWE-bench Verified’s official leaderboard. Where a model’s vendor has not published a particular benchmark, we leave the cell empty rather than guess; this means models with thinner benchmark coverage have a fuzzier composite score, so use composite as a starting point rather than an oracle.

Pricing is taken from each provider’s first-party API. Open-weights pricing reflects the going rate at OpenRouter / Together / Fireworks. We update the dataset monthly; the last-updated stamp at the top of the Leaderboard is the source of truth.

For observability on whichever model you end up choosing, Langfuse is first-rate and has a generous free tier. Helicone, long the other default pick, has been in maintenance mode since Mintlify acquired it in March 2026 — existing deployments keep working, but start new projects on Langfuse.

FAQ

Why don’t you weight the composite score? Because different workloads weight benchmarks differently. A coding-tool team weights SWE-bench heavily; a research-agent team weights GPQA. A single weighted composite would hide that variation.

Why is HumanEval saturated? Because every frontier model now scores >90, and the remaining variation is below the noise floor of the benchmark itself. The model that scores 92.5 is not meaningfully better than the model that scores 91.0.

What about MMLU (the original)? Saturated. We track MMLU-Pro instead, which is a harder version released to break the saturation.

Why aren’t models like Claude 3.5 Sonnet on the list? We dropped models superseded by a newer same-vendor model in the same tier. Claude 3.5 Sonnet was replaced by Claude Sonnet 4.6. If you specifically need pricing for legacy models, vendor docs are the source of truth.

Where do I learn how to choose between these for an actual agent system? Start at AI Agents for the conceptual foundation, then Tokens & Context to understand the resource constraints that shape model choice. The Best AI Agent Frameworks 2026 post covers the framework half of the same decision.

How often does the leaderboard update? Monthly, with the timestamp shown at the top of the Leaderboard page. New model releases get added within seven days of public availability.

Continue reading

AI Models Leaderboard — interactive table behind this post, with sort/filter and a cost calculator.
Best AI Agent Frameworks 2026 — the framework half of the same decision.
Tokens & Context — how context window sizes translate into real cost and capability.
CrewAI vs LangGraph vs AutoGen — pair a model choice with a framework choice.
All Best-Of Lists — index of every tested shortlist on the site.