18th April 2025
To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B:
- If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B.
- If A and B have similar performance, their eval scores should be similar.
Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly.
Recent articles
- Claude Opus 4.8: "a modest but tangible improvement" - 28th May 2026
- I think Anthropic and OpenAI have found product-market fit - 27th May 2026
- Notes on Pope Leo XIV's encyclical on AI - 25th May 2026