As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.
Recent articles
- Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha, with help from uv and OpenAI Codex CLI - 6th November 2025
- Code research projects with async coding agents like Claude Code and Codex - 6th November 2025
- A new SQL-powered permissions system in Datasette 1.0a20 - 4th November 2025