To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B:
- If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B.
- If A and B have similar performance, their eval scores should be similar.
Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly.
Recent articles
- Trying out llama.cpp's new vision support - 10th May 2025
- Saying "hi" to Microsoft's Phi-4-reasoning - 6th May 2025
- Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25) - 5th May 2025