Simon Willison on benchmarks

13 posts tagged “benchmarks”

2025

LLM SVG Generation Benchmark (via) Here's a delightful project by Tom Gally, inspired by my pelican SVG benchmark. He asked Claude to help create more prompts of the form Generate an SVG of [A] [doing] [B] and then ran 30 creative prompts against 9 frontier models - prompts like "an octopus operating a pipe organ" or "a starfish driving a bulldozer".

Here are some for "butterfly inspecting a steam engine":

Gemini 3.0 Pro Preview drew the best steam engine with nice gradients and a butterfly hovering near the chimney. DeepSeek V3.2-Exp drew a floating brown pill with a hint of a chimney and a butterfly possibly on fire. GLM-4.6 did the second best steam engine with a butterfly nearby. Qwen3-VL-235B-A22B-Thinking did a steam engine that looks a bit like a chests on wheels and a weird purple circle.

And for "sloth steering an excavator":

Claude Sonnet 4.5 drew the best excavator with a blobby sloth driving it. Claude Opus 4.5 did quite a blocky excavator with a sloth that isn't quite recognizable as a sloth. Grok Code Fast 1 drew a green alien standing on a set of grey blocks. Gemini 2.5 Pro did a good excavator with another blobby sloth.

It's worth browsing the whole collection, which gives a really good overall indication of which models are the best at SVG art.

# 25th November 2025, 4:02 am / benchmarks, svg, ai, generative-ai, llms, evals, pelican-riding-a-bicycle, tom-gally

python-build-standalone now has Python 3.14.0a5. Exciting news from Charlie Marsh:

We just shipped the latest Python 3.14 alpha (3.14.0a5) to uv and python-build-standalone. This is the first release that includes the tail-calling interpreter.

Our initial benchmarks show a ~20-30% performance improvement across CPython.

This is an optimization that was first discussed in faster-cpython in January 2024, then landed earlier this month by Ken Jin and included in the 3.14a05 release. The alpha release notes say:

A new type of interpreter based on tail calls has been added to CPython. For certain newer compilers, this interpreter provides significantly better performance. Preliminary numbers on our machines suggest anywhere from -3% to 30% faster Python code, and a geometric mean of 9-15% faster on pyperformance depending on platform and architecture. The baseline is Python 3.14 built with Clang 19 without this new interpreter.

This interpreter currently only works with Clang 19 and newer on x86-64 and AArch64 architectures. However, we expect that a future release of GCC will support this as well.

Including this in python-build-standalone means it's now trivial to try out via uv. I upgraded to the latest uv like this:

pip install -U uv

Then ran uv python list to see the available versions:

cpython-3.14.0a5+freethreaded-macos-aarch64-none    <download available>
cpython-3.14.0a5-macos-aarch64-none                 <download available>
cpython-3.13.2+freethreaded-macos-aarch64-none      <download available>
cpython-3.13.2-macos-aarch64-none                   <download available>
cpython-3.13.1-macos-aarch64-none                   /opt/homebrew/opt/python@3.13/bin/python3.13 -> ../Frameworks/Python.framework/Versions/3.13/bin/python3.13

I downloaded the new alpha like this:

uv python install cpython-3.14.0a5

And tried it out like so:

uv run --python 3.14.0a5 python

The Astral team have been using Ken's bm_pystones.py benchmarks script. I grabbed a copy like this:

wget 'https://gist.githubusercontent.com/Fidget-Spinner/e7bf204bf605680b0fc1540fe3777acf/raw/fa85c0f3464021a683245f075505860db5e8ba6b/bm_pystones.py'

And ran it with uv:

uv run --python 3.14.0a5 bm_pystones.py

Giving:

Pystone(1.1) time for 50000 passes = 0.0511138
This machine benchmarks at 978209 pystones/second

Inspired by Charlie's example I decided to try the hyperfine benchmarking tool, which can run multiple commands to statistically compare their performance. I came up with this recipe:

brew install hyperfine
hyperfine \                            
  "uv run --python 3.14.0a5 bm_pystones.py" \
  "uv run --python 3.13 bm_pystones.py" \
  -n tail-calling \
  -n baseline \
  --warmup 10

Running that command produced: Benchmark 1: tail-calling Time (mean ± σ): 71.5 ms ± 0.9 ms [User: 65.3 ms, System: 5.0 ms] Range (min … max): 69.7 ms … 73.1 ms 40 runs Benchmark 2: baseline Time (mean ± σ): 79.7 ms ± 0.9 ms [User: 73.9 ms, System: 4.5 ms] Range (min … max): 78.5 ms … 82.3 ms 36 runs Summary tail-calling ran 1.12 ± 0.02 times faster than baseline

So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro).

# 13th February 2025, 6:25 am / benchmarks, python, uv, astral

2024

Speedometer 3.0: The Best Way Yet to Measure Browser Performance. The new browser performance testing suite, released as a collaboration between Blink, Gecko, and WebKit. It’s fun to run this in your browser and watch it rattle through 580 tests written using a wide variety of modern JavaScript frameworks and visualization libraries.

# 12th March 2024, 4:26 am / benchmarks, javascript, web-performance

2010

The Web Server Benchmarking We Need. Ian Bicking asks for a WSGI benchmark which emphasises error handling over raw performance—can the server keep serving requests if some of them are CPU bound, I/O bound, wedged or cause a segfault?

# 17th March 2010, 10:05 am / benchmarks, ian-bicking, python, wsgi

Dojo 1.4.1 vs jQuery 1.4.2pre on Taskspeed. John Resig’s reponse. When JavaScript libraries compete on performance, everybody wins.

# 29th January 2010, 2:19 pm / benchmarks, dojo, javascript, john-resig, jquery, performance

Dojo: Still Twice As Fast When It Matters Most. Alex Russell shows how Dojo out-performs jQuery on the TaskSpeed benchmark, which attempts to represent common tasks in real-world applications and has had code that have been optimised by the development teams behind each of the libraries.

# 28th January 2010, 10:40 pm / alex-russell, benchmarks, dojo, javascript, jquery, performance, taskspeed

2009

Socket Benchmark of Asynchronous Servers in Python. A comparison of eight different asynchronous networking frameworks in Python. Tornado comes out on top in most of the benchmarks, but the post is most interesting for the direct comparison of simple code examples for each of the frameworks.

# 22nd December 2009, 10:34 pm / async, benchmarks, dieselweb, eventio, eventlet, gevent, orbited, python, stackless, tornado, twisted

JSLitmus. “A lightweight tool for creating ad-hoc JavaScript benchmark tests”. Includes an ingenious hack for graphing the results—it generates a Google Chart, then provides a TinyURL for viewing that chart in the future. The TinyURL is generated by pointing an inconspicuous iframe at the TinyURL API and letting the user copy-and-paste the resulting shortened URL directly out of the iframe.

# 28th October 2009, 5:11 pm / benchmarks, google-charts, iframes, javascript, jslitmus, tinyurl

2008

Dromaeo: JavaScript Performance Testing (via) This is one classy benchmark. Run it in as many browsers as you like (each run is saved to the server and assigned a run ID), then compare the results by appending ?id=[run1],[run2]... to the URL.

# 11th September 2008, 4:06 pm / benchmarks, dromaeo, javascript, john-resig, performance

20,000 Reasons Why Comet Scales. Greg Wilkins coaxes Jetty and Bayeux in to supporting 20,000 simultaneous users per server while maintaining sub-second latency, using Amazon EC2 to run the benchmark.

# 7th January 2008, 8:32 am / bayeux, benchmarks, comet, ec2, greg-wilkins, java, javascript, jetty, performance

2007

PostgreSQL 8.3 vs. 8.2—a simple benchmark. Stefan Kaltenbrunner reports a 2.2x speed increase for PostgreSQL 8.3 compared to 8.2 for a relatively simple benchmark.

# 12th December 2007, 12:42 am / benchmarks, postgresql, stefankaltenbrunner

ErlyWeb vs. Ruby on Rails EC2 Performance Showdown. ErlyWeb’s peak response rate beats Rails by 47x, albeit with a hugely simplified benchmark. More interesting than the results is the idea of using EC2 for benchmarking on identical simulated hardware.

# 10th December 2007, 3:27 pm / amazon, benchmarks, ec2, erlang, erlyweb, performance, rails, virtualisation, yarivsadan

Some Notes on Tim Bray’s Wide Finder Benchmark. Fredrik Lundh demonstrates some Python ninja techniques for parsing log files using multiple cores (and eventually memory mapping).

# 7th October 2007, 1:06 am / benchmarks, effbot, fredrik-lundh, mmap, multicore, python, tim-bray

Simon Willison’s Weblog