Opus 4.8 Aggressively Pursues the Wrong Path

But it's really good at that pursuit

Jun 11, 2026

Increasingly, the most useful lens of frontier coding capability is “model + harness = agent”, rather than talking about the model in isolation. So this review is really about “Opus 4.8 Max plus Claude Code as it exists today.”

Late last year, with Opus 4.1 and 4.5, we saw big leaps in coding capability. But their ability to act like useful contributors was severely curtailed by their laziness. If you gave them a narrow task, they’d crush it. If you asked them to do something that required repeated iteration or work, they’d give up after about 20 minutes.

Opus 4.6 made substantial improvements here, in that if you asked it to do something simple but tedious, it would actually do it. But it still struggled with being insufficiently rigorous in checking its own work – it would report tasks as complete without running the tests, for instance.

With Opus 4.8, this is largely solved:

With sufficient skills/prompting, it will generally autonomously pursue an objective indefinitely
Claude Code now has a Ralph loop equivalent: /goal
With Dynamic Workflows (“Ultracode”), Claude is happy to crank through thousands of menial tasks in parallel – the exact use case that Opuses 4.1 and 4.5 struggled with.

Demonstrating this: Opus 4.8 is the first agent to decently solve my “full repo rewrite task”. In this task, the agent is given a real, multi-year codebase that’s been licensed from a dead startup. It already had a robust black-box HTTP-level backend test suite, which I expanded. The task for the agent is to do a full port of the backend to a different stack, with the expectation that the black-box test suite still passes.

Before Opus 4.8, most agents were trainwrecks on this – tons of failing tests, or refusing to even attempt the task. Opus 4.8 gets 83% of the test suite passing. Obviously, there’s still room for improvement, but this is a big leap.

The next big frontier for improving Opus 4.8’s problem-solving capability is strategic reasoning. Once a path is defined, it’s good at barreling down that path. It’s often weak at figuring out what the best path is, however – and that leads to overall worse performance, because it gets an “80% decent” solution on a fundamentally unsound approach, overlooking a much simpler fix.

Examples of picking the wrong problem-solving strat

“Why isn’t my dev env working?”

You’re an engineer. You open your laptop one morning, and your local dev env doesn’t work due to cryptic errors. You check Slack, and it seems like everyone else is proceeding normally. Do you:

Go on a crazy rabbit hole trying to debug the errors?
First, just clone a fresh instance of the repo + reinstall deps?

The obvious choice is (b). It’s very easy to check, and given that no one else on your team seems to be having the issue, your prior should be that it’s something messed up with your local env.

Opus 4.8 chose (a). It’s connected to my Slack, but also, it could see from the work log markdown files of every previous agent run in the repo that things generally seemed to be working fine. And it could have seen in the git history that there were no recent changes that would have plausibly caused a break. So I think it’s fair to expect it to have chosen (b) given the context available to it.

Fixing a missing dep

Opus was helping me set up software running in a container. The software crashed on boot, because it expected to find a dependency at a given filepath, but the container installs it elsewhere.

Opus’s solution was to add a new plugin to the software that cloned existing functionality but hardcoded a different expected path.

The actual solution was a one-line bash command added to the container build script: just symlink the dep so it shows up where it’s expected to be.

Parallelizing a big task

Opus was helping me run a big parallel task across a sandbox fleet. It was going slowly, so it tried all sorts of micro-optimizations.

The correct solution was… just bump up the parallelism from 4x to 100x.

Opus also did a bunch of optimizations around making the sandboxes build faster, when it should have just asked “why do we rebuild the sandbox for every invocation in the first place?”. The correct solution was just to cache image builds.

Finally, the queue was inefficient because Opus’s orchestration script was artificially subdividing it. It would wait for an entire batch to complete before starting the next one. But because every batch task was independent, there’s no reason to do this – all tasks should just be put in a homogenous queue. Opus didn’t take a step back to realize this.

Other issues

Opus 4.8 sometimes demonstrates a lack of general judgment. I worked with it to debug an issue where writes to a datastore weren’t reflected in subsequent reads. Opus thought that the datastore itself was to blame, so I asked it to write a minimal repro script. It did so, and proudly declared that it had proven the issue – but used a different read API than the actual one our prod system used, so the conclusion was invalid.

Like in past models, Opus 4.8 sometimes under-uses third party deps. Instead of pulling in the standard library parser for a format like YAML, for instance, it’ll just try to use regex – a brittle and unnecessarily-complicated approach.

Previous models improved core coding competence and work ethic. Those things still matter, but the next important frontier is commonsense reasoning.

Nick Heiner's Substack

Discussion about this post

Ready for more?