Simon Willison’s Weblog

Subscribe

Wednesday, 30th October 2024

Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview. The big announcement from GitHub Universe: Copilot is growing support for alternative models.

GitHub Copilot predated the release of ChatGPT by more than year, and was the first widely used LLM-powered tool. This announcement includes a brief history lesson:

The first public version of Copilot was launched using Codex, an early version of OpenAI GPT-3, specifically fine-tuned for coding tasks. Copilot Chat was launched in 2023 with GPT-3.5 and later GPT-4. Since then, we have updated the base model versions multiple times, using a range from GPT 3.5-turbo to GPT 4o and 4o-mini models for different latency and quality requirements.

It's increasingly clear that any strategy that ties you to models from exclusively one provider is short-sighted. The best available model for a task can change every few months, and for something like AI code assistance model quality matters a lot. Getting stuck with a model that's no longer best in class could be a serious competitive disadvantage.

The other big announcement from the keynote was GitHub Spark, described like this:

Sparks are fully functional micro apps that can integrate AI features and external data sources without requiring any management of cloud resources.

I got to play with this at the event. It's effectively a cross between Claude Artifacts and GitHub Gists, with some very neat UI details. The features that really differentiate it from Artifacts is that Spark apps gain access to a server-side key/value store which they can use to persist JSON - and they can also access an API against which they can execute their own prompts.

The prompt integration is particularly neat because prompts used by the Spark apps are extracted into a separate UI so users can view and modify them without having to dig into the (editable) React JavaScript code.

# 1:23 am / github, javascript, ai, react, openai, github-copilot, llms, ai-assisted-programming, anthropic, gemini, claude-artifacts

W̶e̶e̶k̶n̶o̶t̶e̶s̶ Monthnotes for October

I try to publish weeknotes at least once every two weeks. It’s been four since the last entry, so I guess this one counts as monthnotes instead.

[... 797 words]

docs.jina.ai—the Jina meta-prompt. From Jina AI on Twitter:

curl docs.jina.ai - This is our Meta-Prompt. It allows LLMs to understand our Reader, Embeddings, Reranker, and Classifier APIs for improved codegen. Using the meta-prompt is straightforward. Just copy the prompt into your preferred LLM interface like ChatGPT, Claude, or whatever works for you, add your instructions, and you're set.

The page is served using content negotiation. If you hit it with curl you get plain text, but a browser with text/html in the accept: header gets an explanation along with a convenient copy to clipboard button.

Screenshot of an API documentation page for Jina AI with warning message, access instructions, and code sample. Contains text: Note: This content is specifically designed for LLMs and not intended for human reading. For human-readable content, please visit Jina AI. For LLMs/programmatic access, you can fetch this content directly: curl docs.jina.ai/v2 # or wget docs.jina.ai/v2 # or fetch docs.jina.ai/v2 You only see this as a HTML when you access docs.jina.ai via browser. If you access it via code/program, you will get a text/plain response as below. You are an AI engineer designed to help users use Jina AI Search Foundation API's for their specific use case. # Core principles...

# 5:07 pm / documentation, ai, generative-ai, llms, llm, jina

Creating a LLM-as-a-Judge that drives business results (via) Hamel Husain's sequel to Your AI product needs evals. This is packed with hard-won actionable advice.

Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he calls "Critique Shadowing". Find a domain expert (one is better than many, because you want to keep their scores consistent) and have them answer the yes/no question "Did the AI achieve the desired outcome?" - providing a critique explaining their reasoning for each of their answers.

This gives you a reliable score to optimize against, and the critiques mean you can capture nuance and improve the system based on that captured knowledge.

Most importantly, the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge. In other words, it should be detailed enough that a new employee could understand it.

Once you've gathered this expert data system you can switch to using an LLM-as-a-judge. You can then iterate on the prompt you use for it in order to converge its "opinions" with those of your domain expert.

Hamel concludes:

The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!

# 6:08 pm / ai, generative-ai, llms, hamel-husain, evals