25th February 2025 - Link Blog
Aider Polyglot leaderboard results for Claude 3.7 Sonnet (via) Paul Gauthier's Aider Polyglot benchmark is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating new models.
The brand new Claude 3.7 Sonnet just took the top place, when run with an increased 32,000 thinking token limit.
It's interesting comparing the benchmark costs - 3.7 Sonnet spent $36.83 running the whole thing, significantly more than the previously leading DeepSeek R1 + Claude 3.5 combo, but a whole lot less than third place o1-high:
| Model | % completed | Total cost |
|---|---|---|
| claude-3-7-sonnet-20250219 (32k thinking tokens) | 64.9% | $36.83 |
| DeepSeek R1 + claude-3-5-sonnet-20241022 | 64.0% | $13.29 |
| o1-2024-12-17 (high) | 61.7% | $186.5 |
| claude-3-7-sonnet-20250219 (no thinking) | 60.4% | $17.72 |
| o3-mini (high) | 60.4% | $18.16 |
No results yet for Claude 3.7 Sonnet on the LM Arena leaderboard, which has recently been dominated by Gemini 2.0 and Grok 3.
Recent articles
- Is Claude Code going to cost $100/month? Probably not - it's all very confusing - 22nd April 2026
- Where's the raccoon with the ham radio? (ChatGPT Images 2.0) - 21st April 2026
- Changes in the system prompt between Claude Opus 4.6 and 4.7 - 18th April 2026