Simon Willison’s Weblog

Subscribe

Claude Sonnet 4.5 is probably the “best coding model in the world” (at least for now)

29th September 2025

Anthropic released Claude Sonnet 4.5 today, with a very bold set of claims:

Claude Sonnet 4.5 is the best coding model in the world. It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.

Anthropic gave me access to a preview version of a “new model” over the weekend which turned out to be Sonnet 4.5. My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since it launched a few weeks ago. This space moves so fast—Gemini 3 is rumored to land soon so who knows how long Sonnet 4.5 will continue to hold the “best coding model” crown.

The pricing is the same as the previous Sonnet: $3/million input tokens and $15/million output tokens. This remains significantly cheaper than Claude Opus—$15/$75—but still quite a bit more than GPT-5 and GPT-5-Codex, both at $1.25/$10.

It really shines with Claude.ai Code Interpreter

The claude.ai web interface (not yet the Claude iPhone native app) recently added the ability for Claude to write and then directly execute code in a sandboxed server environment, using Python and Node.js. I wrote about that in detail three weeks ago.

Anthropic’s implementation of this code interpreter pattern is more powerful than ChatGPT’s equivalent because it can directly clone code from GitHub and install software packages from NPM and PyPI.

Sonnet 4.5 absolutely shines at using this tool!

Here’s my most ambitious experiment with it. I started like this:

Checkout https://github.com/simonw/llm and run the tests with

pip install -e '.[test]'
pytest

It did exactly as I asked, checking out my LLM repo, installing it along with all of its dependencies, and running the tests.

✅ All 466 tests passed in 167.69 seconds (about 2 minutes 47 seconds)

Then I posed it a wildly ambitious challenge:

The tool stores prompts and responses in a SQLite database. It currently uses a single conversation row and a linear collection of responses within that conversation. Experiment with the idea of adding a parent_response_id column to that responses table and using it to model the responses in a conversation as a tree instead. Use new pytest tests which you execute individually (to save time on the whole test suite) to exercise your design. Write a tree_notes.md file which you first populate with your plan and then use for notes as you go along.

As you can see in the resulting conversation transcript, it did a really impressive job of this! It churned away across dozens of different tool calls, eventually outputting the following summary of its activity:

I’ve successfully experimented with adding tree-structured conversations to the LLM CLI tool. Here’s what was accomplished:

Implementation Complete

1. Database Schema Enhancement

  • Added parent_response_id column to the responses table via migration m022_parent_response_id
  • Enables tree-structured (branching) conversations instead of just linear sequences
  • Fully backward compatible—existing data unaffected

2. Comprehensive Utility Module (tree_utils.py)

  • 12 helper functions for tree operations:
    • Navigation (children, parent, siblings, paths)
    • Analysis (depth, descendants, size)
    • Queries (roots, leaves, statistics)
    • Visualization (ASCII tree display)

3. Complete Test Suite (test_tree_conversations.py)

  • 16 tests covering all tree operations
  • Tests linear chains, branching, multiple roots, forests
  • Includes realistic debugging scenario
  • All tests pass ✅

4. Integration Verification

  • Updated existing migration tests to expect new column
  • All 6 migration tests pass
  • Total: 22/22 tests passing ✅

📦 Deliverables (7 files in /mnt/user-data/outputs/)

  1. README.md—Quick start guide
  2. IMPLEMENTATION_SUMMARY.md—Complete technical overview
  3. MIGRATION_CODE.md—Database migration details
  4. TEST_RESULTS.md—Full test coverage report
  5. tree_notes.md—Design decisions and development notes
  6. tree_utils.py—Utility functions module
  7. test_tree_conversations.py—Test suite

🎯 Key Features

  • Multiple branches from any conversation point
  • Multiple roots per conversation (forest structure)
  • Rich analytics (depth, branching factor, tree size)
  • ASCII tree visualization
  • Cycle detection for safety

🚀 Ready for Integration

The implementation is production-ready with comprehensive testing. Next steps would be:

  1. Integrate tree_utils.py into LLM package
  2. Add CLI commands (llm branch, llm tree)
  3. Update Response.log() to accept parent_response_id

All code is tested, documented, and ready to use!

Did I mention I kicked off this experiment by entering prompts on my phone?

I told it to Give me a zip file of everything you have done so far—you can explore the contents of the file it made me in this Gist.

Pelicans

I released llm-anthropic 0.19 adding support for the new model. Then I ran my classic pelican benchmark, first with thinking turned on:

llm -m claude-sonnet-4.5 -o thinking 1 -o thinking_budget 4000 \
  'Generate an SVG of a pelican riding a bicycle'

And then without:

llm -m claude-sonnet-4.5 \
  'Generate an SVG of a pelican riding a bicycle'

These are pretty good—they are recognizably pelicans!—though not quite as good as GPT-5-Codex which is better at drawing bicycles.

I also tried having it describe this image:

A whole mess of pelicans, described below.

llm -m claude-sonnet-4.5 'describe this image' -a https://static.simonwillison.net/static/2024/pelicans.jpg

This image shows a large gathering of pelicans crowded together along what appears to be a waterfront or shoreline. The birds are densely packed, creating a chaotic scene with dozens of brown pelicans visible. Their distinctive long bills and pouches are clearly visible among the mass of feathered bodies. In the background, there’s calm water and a dark tree line or forested area. The lighting suggests this was taken during early morning or late afternoon, giving the scene a golden, warm tone. The pelicans appear to be resting or congregating together, which is typical behavior for these colonial seabirds.

Anthropic are rolling this out everywhere

The release of this model has been very well coordinated. My embargo on talking about it was due to lift at 10am Pacific today, and I got an email from them linking to their blog post at 10am on the dot. It’s also already live on OpenRouter and in Cursor and GitHub Copilot and no doubt a whole bunch of other places as well.

Anthropic also shipped a new Claude Code VS Code extension today, plus a big upgrade to the Claude Code terminal app. Plus they rebranded their confusingly named Claude Code SDK to the Claude Agent SDK instead, emphasizing that it’s a tool for building agents beyond just customizing the existing Claude Code product. That’s available for both TypeScript and Python.

This is Claude Sonnet 4.5 is probably the “best coding model in the world” (at least for now) by Simon Willison, posted on 29th September 2025.

Previous: I think "agent" may finally have a widely enough agreed upon definition to be useful jargon now

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe