Claude Opus 4.8 Review — Benchmarks, Effort Controls, and Dynamic Workflows
Anthropic shipped Claude Opus 4.8 on May 28, 2026 — API model ID claude-opus-4-8 — and called it “a modest but tangible improvement” over Opus 4.7. That framing is honest, and it is the right lens for this review. Opus 4.8 is not a generational leap. It is a point release that moves the needle on agentic coding, math, and honesty, ships two genuinely useful product features (an effort control and Dynamic Workflows), and does it all without raising the price. This is a review of what actually changed and whether it is worth switching.
TL;DR verdict
| Claude Opus 4.8 | |
|---|---|
| Released | May 28, 2026 |
| Price | $5 / 1M input · $25 / 1M output (unchanged from 4.7) |
| Context window | 1,000,000 tokens · 128k max output |
| Headline gain | Agentic coding — SWE-bench Pro 69.2% vs 4.7’s 64.3% |
| New features | effort parameter (high/extra/max) · Dynamic Workflows in Claude Code |
| Availability | Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry |
| Best for | Long-horizon agentic coding, math-heavy work, code review |
| Skip if | You only use it for chat — the delta there is small |
If you do not read past this: upgrade for agentic coding and math, re-test your scaffolding first, and the unchanged price means there is little downside to making claude-opus-4-8 your default.
What actually changed
Three things matter in this release: the benchmark deltas, the new effort control, and Dynamic Workflows. Everything else is a rounding error.
Benchmarks — the gains are concentrated in coding and math
Anthropic positioned 4.8 as a coding and agentic upgrade, and the published numbers back that up:
| Benchmark | Opus 4.8 | Opus 4.7 | Notes |
|---|---|---|---|
| SWE-bench Verified (500 problems) | 88.6% | 87.6% | +1.0 pt; leads Gemini 3.1 Pro (80.6%) |
| SWE-bench Pro (contamination-resistant) | 69.2% | 64.3% | +4.9 pt; ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%) |
| USAMO 2026 (math olympiad) | 96.7% | 69.3% | +27.4 pt — the largest single-release jump Anthropic published |
| Humanity’s Last Exam (no tools) | 49.8% | 46.9% | Leads GPT-5.5 (41.4%) and Gemini 3.1 Pro (44.4%) |
The SWE-bench Pro number is the one to weight most. The original SWE-bench Verified set is increasingly contaminated — OpenAI stopped reporting Verified scores in early 2026 and now points to Pro — so the ~5-point Pro gain is a better signal of real-world coding improvement than the 1-point Verified bump. The USAMO jump is eye-catching but narrow: it tells you 4.8 is much stronger at competition math, not that every task improved by 27 points.
For the full cross-vendor picture — GPQA, MMLU-Pro, MMMU, Aider Polyglot, tau-bench, and pricing side by side — see the AI Models Leaderboard, where Opus 4.8 currently sits at the top of the composite ranking.
The new effort parameter
Opus 4.8 introduces an effort control that governs how many tokens the model will spend reasoning before it answers. It defaults to high, which Anthropic describes as its best balance of token spend and output quality. You can raise it to extra (surfaced as xhigh in the Claude Code effort menu) or max for harder problems, or drop it lower to save tokens on routine work.
This is the practical lever most teams will reach for. The interesting second-order effect, flagged by Cursor’s Michael Truell, is that on CursorBench, Opus 4.8 reaches the same result in fewer steps than 4.7 — so even at the same effort setting, the token-per-task cost tends to drop. In other words, the headline price is unchanged but the effective cost per finished task can come down on agentic workloads.
Dynamic Workflows in Claude Code
The biggest product change ships in Claude Code, not the model card. Dynamic Workflows lets Claude write a JavaScript orchestration script that decomposes a large task and delegates the pieces across up to 1,000 parallel subagents in a background runtime. Intermediate results live in script variables rather than the main context window, which is what makes it viable at scale.
The intended jobs are codebase-scale migrations, repository-wide bug sweeps, and multi-service refactors — the kind of work that previously blew past a single context window or required you to hand-roll a fan-out harness. If you have built multi-agent pipelines by hand, this is Anthropic moving that pattern into the product. It is powerful and also the feature most likely to surprise you on a bill, so cap it on a real task before turning it loose on a monorepo.
Honesty and code review
Anthropic put real weight on reliability this cycle. The claim worth repeating: Opus 4.8 is four times less likely than Opus 4.7 to let a code flaw pass without flagging it. In practice that means fewer “looks good to me” reviews on code that quietly contains a bug — the failure mode that makes an agentic reviewer untrustworthy. It lines up with the SWE-bench Pro gain: a model that catches more of its own mistakes is a model that closes more issues correctly.
For code-review-heavy workflows this is arguably more valuable than the raw benchmark delta. A reviewer you can trust to flag the subtle stuff changes how much you can delegate.
Pricing — what it costs
Pricing is unchanged from Opus 4.7, which is the quiet headline:
- Input: $5.00 per 1M tokens
- Output: $25.00 per 1M tokens
- Prompt caching: $6.25 write / $0.50 read per 1M tokens
- Context window: 1,000,000 tokens · max output: 128,000 tokens
At the same price as the model it replaces, with measurable gains and a steps-per-task reduction on agentic work, the cost story is straightforwardly positive. If you are price-sensitive and your work is not agentic, a cheaper model on the leaderboard may still be the rational pick — Opus-class pricing only pays for itself when you are using the autonomy.
Who should upgrade
- Agentic coding teams: Yes. The SWE-bench Pro gain and the honesty improvement both target the exact failure modes that matter for autonomous, multi-file work. Switch your default to
claude-opus-4-8. - Math / research: Yes, if competition-grade reasoning matters to you — the USAMO jump is real.
- Code reviewers: Yes. The “4× less likely to miss a flaw” improvement is the most reviewer-relevant change in the release.
- Chat-only users: Optional. The general-conversation delta is small; there is no penalty to upgrading, but do not expect a night-and-day difference.
- Anyone with hard-coded scaffolding: Re-test first. If your prompts assume 4.7’s verbosity or pin a specific effort behavior, validate before moving production traffic.
How to switch
Point your client at the new model ID and, if you use the SDK, decide whether to pin an effort level:
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create( model="claude-opus-4-8", # was claude-opus-4-7 max_tokens=4096, # effort defaults to "high"; raise for harder agentic tasks extra_body={"effort": "high"}, # "high" | "extra" | "max" messages=[{"role": "user", "content": "Refactor this module and run the tests."}],)print(resp.content[0].text)In Claude Code, the effort menu exposes the same levels (with xhigh as the label for extra), and Dynamic Workflows is available on long-running tasks without any config change. See Agent Instructions for how to scope a CLAUDE.md so the upgraded model stays inside your conventions.
FAQ
Is Opus 4.8 more expensive than 4.7? No — same $5 / $25 per million tokens, same 1M context, same 128k output ceiling.
What does the effort parameter do? It sets how many tokens the model spends reasoning before answering. Default is high; extra (xhigh) and max trade more tokens for depth on hard tasks.
What are Dynamic Workflows? A Claude Code feature where Claude writes a JavaScript orchestration script to fan a large task out across many parallel subagents in the background, keeping intermediate state out of the main context window.
Should I upgrade from 4.7? For coding, math, and review: yes. For chat: optional. Re-test any scaffolding that hard-codes effort or depends on 4.7’s output style first.
How does it compare to GPT-5.5 and Gemini 3.1 Pro? On the contamination-resistant SWE-bench Pro, Opus 4.8 (69.2%) leads both GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%). The full table is on the leaderboard.
Continue reading
- AI Models Leaderboard — Opus 4.8 versus 50+ models on benchmarks, pricing, and context window.
- Claude Code vs Cursor vs Codex — where Opus 4.8 actually runs for most coding work.
- LLM Benchmark Comparison 2026 — how to read SWE-bench, GPQA, and the rest without getting fooled.
- Multi-Agent Pipelines — the hand-rolled version of what Dynamic Workflows automates.
- All Reviews — index of every head-to-head review on the site.