How is the composite LLM score calculated?

The composite is the simple mean of every published benchmark for a model (MMLU-Pro, GPQA Diamond, HumanEval, SWE-bench Verified, MATH-500, MMMU, Aider Polyglot, tau-bench). Where a vendor has not reported a benchmark, the cell is empty and the composite excludes it.

What does the blended price column mean?

A 3:1 input-to-output weighted average price per million tokens, approximating the cost shape of typical production agent workloads (heavy input from context, smaller output per call).

Which benchmark is the most predictive for AI coding tools?

SWE-bench Verified — it measures whether a model can resolve real GitHub issues end-to-end. Above 70 is frontier; above 50 is shippable for most coding workloads.

How often is the leaderboard updated?

Monthly. The last-updated timestamp at the top of the page is the source of truth. New model releases get added within seven days of public availability.

Why does pricing differ from a vendor's headline price?

Headline prices are usually input or output alone. The blended price reflects realistic agent traffic, and Anthropic/OpenAI prompt caching can drop the effective input cost by 60-80% on workloads with repeated context.

AI Models Leaderboard — Benchmarks, Pricing, and Comparison (2026-07-12)

How to read this leaderboard

The composite score is the simple mean of every published benchmark for a model. We do not normalise across benchmarks because they are already roughly comparable in their published units (most are 0-100 accuracy). Where a vendor has not reported a benchmark, the cell is empty and the composite excludes it — which means a model with fewer reported benchmarks is comparable to one with more, but tied scores should be read carefully.

The blended price column is a 3:1 input-to-output weighted average, which approximates the cost shape of most production agent workloads (heavy input from context and history, smaller output per call). Prompt caching changes this calculus substantially — Anthropic and OpenAI both ship aggressive caching that can drop effective price by 60-80% on repeated-context workloads. If your application is conversational or RAG-heavy, the cached-read price is the one you should optimise against, not the headline input price.

Which benchmark actually matters

MMLU-Pro measures broad knowledge across 14 disciplines. Useful as a general capability indicator; weak as a predictor of agent performance.

GPQA Diamond measures graduate-level science reasoning on questions verified to be hard for non-experts. A reliable indicator of how a model handles novel reasoning rather than retrieved facts.

HumanEval is the original "synthesise a Python function from a docstring" benchmark. Saturated for frontier models; treat scores >85% as a baseline competency check, not a differentiator.

SWE-bench Verified measures end-to-end issue resolution on real GitHub repositories — closing tickets, not solving toy problems. The single most predictive benchmark for AI coding tool performance in our experience.

MATH-500 measures competition mathematics. Saturated for reasoning-focused models (o3, R1) and a strong indicator of chain-of-thought capability.

MMMU measures multimodal reasoning across images and text. Only matters if your application uses vision.

Aider Polyglot measures multi-language code editing across Python, Go, Rust, TypeScript, JavaScript, and C++. Practical and harder to game than HumanEval.

tau-bench measures tool-use in conversational agent settings (retail and airline scenarios). The best published proxy for "how good is this model as an agent" until a better benchmark replaces it.

Picking a model for production

Three rules of thumb after spending the past 12 months testing most models on this list in production-shaped workloads:

1. Start cheaper, upgrade only on observed regression. Most agent flows do not need a frontier model. We default to Claude Sonnet 4.6 or DeepSeek V3.1 and only move to Opus / GPT-5 / o3 when we see a quality cliff. The cost difference compounds fast at production volume.

2. Optimise for cached-read price, not headline price. If your workload has any repeated context (system prompts, conversation history, RAG corpora), the cached-read price column dominates monthly spend. Anthropic and OpenAI both offer caching; Google's flash variants have implicit caching as well.

3. Benchmark your own task before committing. Run the cheapest plausible candidate on 30 examples of your real task. The leaderboard tells you which models to consider; only your eval set tells you which model to ship.

Methodology and sources

Benchmark numbers are aggregated from vendor model cards, the LMSYS Chatbot Arena leaderboard, Artificial Analysis, the HuggingFace Open LLM Leaderboard, the Aider leaderboard, and SWE-bench Verified. Pricing is taken from each provider's first-party API on 2026-07-12; open-weights pricing reflects the typical OpenRouter / Together / Fireworks rate. We update the data monthly and bump the timestamp in this page's header on every refresh.

For the long-form analysis behind these numbers — which models we use day-to-day, which we ruled out, and how the landscape shifted in the last six months — read the companion post LLM Benchmark Comparison 2026.

AI Models Leaderboard

Benchmark Leaders

Price vs Performance

Cost Calculator

How to read this leaderboard

Which benchmark actually matters

Picking a model for production

Methodology and sources