Simon Willison’s Weblog

Subscribe

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

24th November 2025

Anthropic released Claude Opus 4.5 this morning, which they call “best model in the world for coding, agents, and computer use”. This is their attempt to retake the crown for best coding model after significant challenges from OpenAI’s GPT-5.1-Codex-Max and Google’s Gemini 3, both released within the past week!

The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 “reliable knowledge cutoff” (Sonnet 4.5 is January, Haiku 4.5 is February).

The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for >200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5.

The Key improvements in Opus 4.5 over Opus 4.1 document has a few more interesting details:

I had access to a preview of Anthropic’s new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in a new alpha release of sqlite-utils that included several large-scale refactorings—Opus 4.5 was responsible for most of the work across 20 commits, 39 files changed, 2,022 additions and 1,173 deletions in a two day period.

It’s clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha. I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I’d been achieving with the new model.

With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.

I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.

This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn’t possible before. In the past these have felt a lot more obvious, but today it’s often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.

Google’s Nano Banana Pro image generation model was notable in that its ability to render usable infographics really does represent a task at which previous models had been laughably incapable.

The frontier LLMs are a lot harder to differentiate between. Benchmarks like bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?

And honestly, this is mainly on me. I’ve fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch if these but they’ve fallen one-by-one and now I’m embarrassingly lacking in suitable challenges to help evaluate new models.

I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on—a tip I picked up from Ethan Mollick. I need to double-down on that advice myself!

I’d love to see AI labs like Anthropic help address this challenge directly. I’d like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle.

“Here’s an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5” would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.

In the meantime, I’m just gonna have to keep on getting them to draw pelicans riding bicycles. Here’s Opus 4.5 (on its default “high” effort level):

The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars.

It did significantly better on the new more detailed prompt:

The pelican has feathers and a red pouch - a close enough version of breeding plumage. The bicycle is a much better shape.

More recent articles

This is Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult by Simon Willison, posted on 24th November 2025.

Previous: sqlite-utils 4.0a1 has several (minor) backwards incompatible changes

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe