Claude: Building evals and test cases. More documentation updates from Anthropic: this section on writing evals for Claude is new today and includes Python code examples for a number of different evaluation techniques.
Included are several examples of the LLM-as-judge pattern, plus an example using cosine similarity and another that uses the new-to-me Rouge Python library that implements the ROUGE metric for evaluating the quality of summarized text.
Recent articles
- Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model - 20th November 2025
- How I automate my Substack newsletter with content from my blog - 19th November 2025
- Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark - 18th November 2025