Opus 4.6: long haul breakthrough
… with a modest but noticeable improvement on SWE
Agents have been good at short-form tasks for a while, but struggle at “long haul work” – where they need to keep on truckin’ for an extended period. For example, if you ask an agent to implement a spec, it’s liable to implement half of it, stop, cheerfully report progress, and ask “would you like me to continue?”
To isolate this, I created an agentic task that was simply generating 1000 patient medical histories in a particular format. But the agent is explicitly instructed that it cannot do a “mad libs”-style Python script, as this will lead to incoherent, randomly-selected combinations. (If someone had a hysterectomy, their medical history should not say they gave birth the following year.) Instead, the model needs to write each record individually, to make sure that it’s depicting a realistic scenario.
Opus 4.11 really insisted on making a generator script, even when explicitly forbidden from doing so.
Sonnet 4.5 would make medical histories by hand… but stop every 40. So you had to Ralph Wiggums it with “keep going”.
Opus 4.5 would, on rare occasions, one-shot 1000 histories. But more frequently, it would behave like Sonnet 4.5 and stop early. (But then, once you read those records, there was minimal diversity – 736 of the 1000 would be for a guy named “Marcus Chen”.)
Opus 4.6 consistently generates 1000 records, without using a generator script. Instead, it launches an agent swarm, where each agent is responsible for a small set of records. And, amazingly, it actually shows meta-awareness that its default will be Marcus Chen, and employs specific strategies to un-collapse the agent swarm. And that agent swarm produces histories that are more realistic and schema-adherent than those produced by previous models.
Generating 1000 medical histories is not something that most of us do in our day jobs. But “apply thoughtful intelligence to this large set of documents” is, and that’s exactly what we’ve seen Opus 4.6 improving on.
With previous model releases, the primary gains were in narrow problem solving, such as writing a single function in response to a chat query. Today, although the core problem solving capabilities are continuing to improve, the bigger gains are coming from an improved capacity to agentically apply that problem solving.
Does that mean it’s a fully autonomous coding agent?
Not quite.
I tested Opus 4.6 using my private coding benchmark which tests an agent’s ability to act autonomously. Today’s agents can crush narrow coding tasks that would be a considerable challenge for all but the top humans. But when it’s time to go beyond the textbook and actually do software engineering, agents still fall short. Key failure modes include:
Failing to verify their own work (even when the ability to verify is readily available)
Failure to follow instructions when there’s a lot of them / failure to follow all requirements
Stopping early rather than fully completing the task as requested
The failure to verify can be particularly pernicious – in the course of building software, little mistakes (with big implications) are very common. (For instance, updating a build script config and not realizing that you broke one of your output targets.) So one-shotting everything with no feedback is virtually impossible, because if you don’t run the test suite for your codebase, you’ll never catch those issues.
For instance, in one coding task, Opus 4.6 chose to run npm init -y as part of an update to package.json. This overwrote the scripts entry in that file, breaking the rest of the project – which would have been trivially noticeable if Opus had done the most basic npm test or npm start check.
Another task asked Opus 4.6 to port jq from C to Rust, and gave Opus the jq C source code as a starting point. This contains two golden means of verifying the migration: jq’s own test suite, and the ability to build from source and do side-by-side comparison tests between the original and port.
Opus didn’t do either of those things. It wrote its own test suite and bragged about “195/195 passing” – while completely ignoring the 100s of tests in the original test suite that would fail if run on its implementation.
Coding agents are incredibly powerful on tasks where they can verify their own work. They’re merely decent when they can’t. So when agents are more able to figure out how to verify their own work, we’re going to see huge gains.
In another task, Opus was given a codebase and a spec, and had to write a test suite verifying that the codebase implemented the spec. Opus generally covered most, but not all, of the aspects of the spec. But it also made some mistakes in its test suite, like generating invalid timestamps:
for i in range(10):
start_sec = i * 10
end_sec = start_sec + (i + 1)
batch.append({...
"timestamp": f"2025-08-01T00:00:{start_sec:02d}Z",
})
batch.append({...
"timestamp": f"2025-08-01T00:00:{start_sec + (i + 1):02d}Z",
})When i >= 6, start_sec will be >= 60, which is invalid: a timestamp needs to have the seconds part be 0-59. (You can’t have a minute with 80 seconds in it.)
As a result, the test in question “failed”, because the server didn’t give the “expected” 2xx response. Opus took this failure at face value (“the server must have a bug”), rather than looking at the test output to understand the why.
This is part of a broader trend where Opus can be too quick to declare something done, insufficiently skeptical of its own work, too confident that “this issue isn’t related to my work”, unrigorous about making sure all requirements are satisfied, etc.
The interesting thing is that, although Opus has this systemic behavioral issue, it can almost always solve the problem if you micromanage it. (“Are you sure that build failure is due to a pre-existing condition?” “Look at the server error and see if it’s truly a server bug, vs. an issue with your tests.” “Did you actually check to make sure you fulfilled all the requirements in the spec?”) This is why the Ralph Wiggums plugin is somewhat effective (although I’ve found it to be counterproductive on net.)
Unfortunately, attempting to do this in an automated fashion is difficult. You can add instructions like “when you get a test failure, carefully consider whether the test you wrote is broken or it’s revealing a true app code issue”. But that runs into Opus’ other issue: when you pile on a lot of instructions, it’s worse at following them. So there’s limited ability to just Skill your way out of this. (It also means that, while Opus is an absolute shark in new projects, its performance degrades in larger codebases that require acting with more context.)
On my benchmark, Opus 4.6 showed clear improvement over Opus 4.5. But it still had enough issues that, as I’m building software, I still need to help it verify its own work.
Bakeoff: Codex 5.3 vs. Claude Code Opus 4.6
I gave both agents a one-shot prompt to create, from scratch, a multiplayer realtime chess-like game. The agents were given a blank repo that had only the .env they needed to connect to the external services (like Supabase).
Both agents created fully-fleshed out apps, but neither was usable.
Opus’ gameplay was broken. Although individual piece moves were validated, the entire turn wasn’t – so whichever player goes first could just do one giant, illegal turn, winning instantly. And perhaps relatedly, there was no way to submit your turn and let the other player go.
With Codex, I didn’t even get that far – in the flow, you have to make a game, then send the link to your opponent so they can play you. However, in Codex’s app, there’s no way for the opponent to actually join the game – they just see a spectator screen. So you can’t even start the game.
Codex’s UI generally looked nicer, and in particular, it successfully implemented Dark Mode. Opus made an attempt at Dark Mode, but some UI elements weren’t handled properly, producing a jarring, broken UI.
Beyond SWE: Customer Service
We also tested Opus in our Corecraft RL env to evaluate its ability to complete agentic customer support tasks.
Similar to SWE, Opus was prone to disregarding instructions and not completing its work. Despite being told in the system prompt that “you need to resolve this customer support ticket without asking for clarifications or follow-ups”, and the agent having all the info it needed to do so, it would sometimes still stop early and either report progress (“would you like me to finish?”) or ask for unnecessary clarification.
Opus also sometimes failed to read instructions carefully. For example, the company policy says:
Do not share a customer’s PII with other customers.
One of the tasks is for the agent to fetch a list of emails for customers with delayed shipments so the customer support agent can reach out with an apology.
1 of 3 Opus trials refused on safety grounds, misinterpreting the system prompt to mean “do not share a customer’s PII with anyone” – even though the prompter was a verified employee who was authorized to view the information.
2 of the 3 Opus trials agreed to share the emails, but then just failed to paginate through the relevant data source. They fetched a handful of pages, then stopped.
Another task saw the same failure mode. In this one, the agent is handling a customer support request for a custom PC, and in the course of doing so, needs to notice that the customer is asking for an incompatible set of parts. According to the customer support standard operating procedure, in this scenario, agents must provide two options for how the customer can change their order to be compatible: a budget pick, and an upgrade pick.
Opus successfully noticed the compatibility, but stopped there, incorrectly asking the user a question. By contrast, GPT-5.2 researched to find ways to swap parts to make the build compatible, and presented them to the user, adhering to the policy.
(Beyond this pattern, Opus was penalized on other tasks for a long tail of agentic errors.)
Conclusion
With the Claude 3 series, we saw the first glimmers of agentic capability. Products like Cursor started to work.
With Claudes 4.1 and 4.5, we saw improved core problem solving, and stronger agentic behavior – but coders still hit a hard ceiling on autonomy when trying to get the agent to be productive when working unsupervised for 30+ minutes.
With Claude Opus 4.6, we see a step-function improvement in agentic behavior – in greenfield settings. And somewhat improved agentic behavior in more complicated contexts.
So now that Anthropic has shown that it can get that agentic behavior correct in any context, I expect future releases will raise the complexity ceiling at which the agent continues to behave effectively.
The thing we haven’t seen even a glimmer of yet is the agent going beyond the “order-taking intern” to acting as a proactive collaborator who suggests novel ideas or smartly pushes back.
All the examples in this section are using Claude Code as the agent harness

