Simon Willison’s Weblog

Subscribe
Atom feed for hamel-husain

3 items tagged “hamel-husain”

2024

Creating a LLM-as-a-Judge that drives business results (via) Hamel Husain's sequel to Your AI product needs evals. This is packed with hard-won actionable advice.

Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he calls "Critique Shadowing". Find a domain expert (one is better than many, because you want to keep their scores consistent) and have them answer the yes/no question "Did the AI achieve the desired outcome?" - providing a critique explaining their reasoning for each of their answers.

This gives you a reliable score to optimize against, and the critiques mean you can capture nuance and improve the system based on that captured knowledge.

Most importantly, the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge. In other words, it should be detailed enough that a new employee could understand it.

Once you've gathered this expert data system you can switch to using an LLM-as-a-judge. You can then iterate on the prompt you use for it in order to converge its "opinions" with those of your domain expert.

Hamel concludes:

The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!

# 30th October 2024, 6:08 pm / ai, generative-ai, llms, hamel-husain, evals

LLM bullshit knife, to cut through bs

RAG ->              Provide relevant context
Agentic ->          Function calls that work
CoT ->              Prompt model to think/plan
FewShot ->          Add examples
PromptEng ->        Someone w/good written comm skills.
Prompt Optimizer -> For loop to find best examples.

Hamel Husain

# 7th June 2024, 6:02 pm / ai, generative-ai, llms, rag, hamel-husain

Your AI Product Needs Evals (via) Hamel Husain: “I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.”

I’ve been frustrated about this for a while: I know I need to move beyond “vibe checks” for the systems I have started to build on top of LLMs, but I was lacking a thorough guide about how to build automated (and manual) evals in a productive way.

Hamel has provided exactly the tutorial I was needing for this, with a really thorough example case-study.

Using GPT-4 to create test cases is an interesting approach: “Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.”

Also important: “... unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision.”

Hamel’s guide then covers the importance of traces for evaluating real-world performance of your deployed application, plus the pros and cons of leaning on automated evaluation using LLMs themselves.

Plus some wisdom from a footnote: “A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.”

# 31st March 2024, 9:53 pm / testing, ai, generative-ai, llms, hamel-husain, evals