23rd July 2024
As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.
Recent articles
- Claude Fable is relentlessly proactive - 11th June 2026
- Initial impressions of Claude Fable 5 - 9th June 2026
- Running Python code in a sandbox with MicroPython and WASM - 6th June 2026