Pydantic Evals (via) Brand new package from David Montague and the Pydantic AI team which directly tackles what I consider to be the single hardest problem in AI engineering: building evals to determine if your LLM-based system is working correctly and getting better over time.
The feature is described as "in beta" and comes with this very realistic warning:
Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.
This code example from their documentation illustrates the relationship between the two key nouns - Cases and Datasets:
from pydantic_evals import Case, Dataset case1 = Case( name="simple_case", inputs="What is the capital of France?", expected_output="Paris", metadata={"difficulty": "easy"}, ) dataset = Dataset(cases=[case1])
The library also supports custom evaluators, including LLM-as-a-judge:
Case( name="vegetarian_recipe", inputs=CustomerOrder( dish_name="Spaghetti Bolognese", dietary_restriction="vegetarian" ), expected_output=None, metadata={"focus": "vegetarian"}, evaluators=( LLMJudge( rubric="Recipe should not contain meat or animal products", ), ), )
Cases and datasets can also be serialized to YAML.
My first impressions are that this looks like a solid implementation of a sensible design. I'm looking forward to trying it out against a real project.
Recent articles
- Phoenix.new is Fly's entry into the prompt-driven app development space - 23rd June 2025
- Trying out the new Gemini 2.5 model family - 17th June 2025
- The lethal trifecta for AI agents: private data, untrusted content, and external communication - 16th June 2025