Anthropic launches Claude Opus 4.6 with 1M token context, agent teams, and top benchmark scores

Anthropic has released Claude Opus 4.6, its new flagship AI model, and the upgrade is substantial across almost every dimension that matters for real work. Better coding, longer context, multi-agent coordination, stronger reasoning benchmarks, and new tools for developers to control cost and speed. Here is what has changed, and what it means practically.

A bigger context window, finally

The headline feature for Opus users is a 1 million token context window, currently available in beta on the Claude Platform. To put that in perspective: 1 million tokens is roughly 750,000 words, or somewhere in the region of several large novels, a full codebase, or months of meeting transcripts fed in at once.

Previous Opus models did not offer this. And the benchmark numbers suggest Claude actually uses that context well. On MRCR v2, which tests a model’s ability to retrieve specific information buried inside large documents, Opus 4.6 scores 76%. Claude Sonnet 4.5, for comparison, scores 18.5%. That is a meaningful difference if you are working with long contracts, research reports, or codebases where the relevant detail could be anywhere.

Benchmark performance worth paying attention to

Anthropic is claiming top scores on several evaluations, and unlike some benchmark announcements, a few of these are worth highlighting specifically.

On Terminal-Bench 2.0, an evaluation focused on agentic coding in real terminal environments, Opus 4.6 scores highest of any frontier model currently available. On Humanity’s Last Exam, a notoriously difficult multidisciplinary reasoning test, it also leads the field.

The more commercially interesting result is on GDPval-AA, which measures performance on economically valuable knowledge work in finance, legal, and similar domains. Opus 4.6 outperforms OpenAI’s GPT-5.2 by around 144 Elo points on this evaluation, and beats its own predecessor, Claude Opus 4.5, by 190 points. For knowledge workers using AI to support real business tasks, that gap is meaningful.

On legal reasoning specifically, Opus 4.6 scored 90.2% on BigLaw Bench, the highest of any Claude model to date, with 40% of answers rated as perfect.

Agent teams: the most practically significant addition

The feature with the most day-to-day implications, particularly for developers and enterprise teams, is what Anthropic is calling agent teams inside Claude Code.

The idea is straightforward: instead of one AI agent working through a complex task step by step, you can now split that task across multiple agents running in parallel, each handling its own piece and coordinating with the others. Anthropic describes it as agents that can own subtasks independently rather than handing work back to a single coordinator at every step.

For large software projects, data pipelines, or research tasks with genuinely separable components, this can reduce wall-clock time significantly. Anthropic tested this on cybersecurity investigations specifically: across 40 blind evaluations, Opus 4.6 using up to 9 subagents and over 100 tool calls produced the best results 38 out of 40 times compared to Claude 4.5 models.

Adaptive thinking and effort controls

Anthropic is introducing a new thinking mode called adaptive thinking. Rather than forcing the model to think deeply on every prompt (which adds latency and cost), or disabling extended thinking entirely, adaptive thinking lets the model read contextual signals and decide how much reasoning to apply.

At the default effort level (high), the model almost always thinks. At lower effort levels, it will skip extended reasoning on simpler questions. Developers can control this directly using the /effort parameter, and Anthropic recommends dialling down from the default if the model is overthinking routine tasks.

This is a practical acknowledgement that not every query needs the same compute budget, and it gives teams a straightforward lever to manage cost without sacrificing quality on the tasks that need it.

There is also a fast mode for Opus models, which delivers up to 2.5x faster output token generation at premium pricing ($30/$150 per million tokens). The model itself is unchanged; it is the same intelligence running on faster inference infrastructure.

What this means for different users

For enterprise teams already using Claude, Opus 4.6 is a direct upgrade on the tasks where Claude already performs well: long-document analysis, legal and financial reasoning, and complex research. The 1M context window and the improved long-context retrieval scores are the most tangible improvements here.

For developers building on the Claude API, the agent teams capability and the new adaptive thinking controls are the two things to evaluate first. Context compaction, which automatically summarises earlier parts of a conversation when the context window fills up, is also available on the API now, enabling longer-running agentic tasks without manual context management.

For legal and cybersecurity teams, the BigLaw Bench score and the cybersecurity investigation results are specific, verifiable signals that the model has improved on domain-relevant reasoning.

For everyday Claude users on Pro, Max, Team, or Enterprise plans, Opus 4.6 is available now. Pricing is unchanged at $5/$25 per million input/output tokens.

Availability

Claude Opus 4.6 is available on the Claude Platform, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry on Azure. It is also now generally available in GitHub Copilot.

Anthropic has also released Claude in PowerPoint as a research preview and made upgrades to Claude in Excel. The full technical details, including the system card and API documentation, are on Anthropic’s site.