Simon Willison on llm

599 posts tagged “llm”

LLM is my command-line tool for running prompts against Large Language Models.

2025

Release llm-gemini 0.19.1 — LLM plugin to access Google's Gemini family of models

8th May 2025, 5:42 am · llm, gemini

Release llm-mistral 0.12 — LLM plugin providing access to Mistral models using the Mistral API

7th May 2025, 9:27 pm · mistral, llm

Medium is the new large. New model release from Mistral - this time closed source/proprietary. Mistral Medium claims strong benchmark scores similar to GPT-4o and Claude 3.7 Sonnet, but is priced at $0.40/million input and $2/million output - about the same price as GPT 4.1 Mini. For comparison, GPT-4o is $2.50/$10 and Claude 3.7 Sonnet is $3/$15.

The model is a vision LLM, accepting both images and text.

More interesting than the price is the deployment model. Mistral Medium may not be open weights but it is very much available for self-hosting:

Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments of four GPUs and above.

Mistral's other announcement today is Le Chat Enterprise. This is a suite of tools that can integrate with your company's internal data and provide "agents" (these look similar to Claude Projects or OpenAI GPTs), again with the option to self-host.

Is there a new open weights model coming soon? This note tucked away at the bottom of the Mistral Medium 3 announcement seems to hint at that:

With the launches of Mistral Small in March and Mistral Medium today, it's no secret that we're working on something 'large' over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we're excited to 'open' up what's to come :)

I released llm-mistral 0.12 adding support for the new model.

# 7th May 2025, 9:14 pm / llm-release, mistral, generative-ai, ai, llms, llm-pricing, llm, vision-llms

Saying “hi” to Microsoft’s Phi-4-reasoning

Microsoft released a new sub-family of models a few days ago: Phi-4 reasoning. They introduced them in this blog post celebrating a year since the release of Phi-3:

[... 1,498 words]

6:25 pm / 6th May 2025 / ollama, phi, microsoft, local-llms, llm-release, llm, generative-ai, llm-reasoning, qwen, llms, ai-in-china

Release llm-gemini 0.19 — LLM plugin to access Google's Gemini family of models

6th May 2025, 4:25 pm · gemini, llm

Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25)

The new llm-video-frames plugin can turn a video file into a sequence of JPEG frames and feed them directly into a long context vision LLM such as GPT-4.1, even when that LLM doesn’t directly support video input. It depends on a plugin feature I added to LLM 0.25, which I released last night.

[... 1,600 words]

5:38 pm / 5th May 2025 / vision-llms, llm, plugins, ai, llms, generative-ai, projects, ffmpeg, ai-assisted-programming, cli

Release llm-hacker-news 0.1.1 — LLM plugin for pulling content from Hacker News

5th May 2025, 5:10 pm · llm, hacker-news

Release llm-docs 0.2.1 — LLM plugin for asking questions of LLM's own documentation, and related packages

5th May 2025, 5:04 pm · llm

Release llm-fragments-github 0.3 — Load GitHub repository contents as LLM fragments

5th May 2025, 5:16 am · llm

Release llm-video-frames 0.1 — LLM fragment plugin to turn a video into images of different frames

5th May 2025, 3:39 am · llm

Release llm 0.25 — Access large language models from the command-line

5th May 2025, 3:30 am · llm

Release llm-echo 0.2 — Debug plugin for LLM providing an echo model

4th May 2025, 4:15 pm · llm

Having tried a few of the Qwen 3 models now my favorite is a bit of a surprise to me: I'm really enjoying Qwen3-8B.

I've been running prompts through the MLX 4bit quantized version, mlx-community/Qwen3-8B-4bit. I'm using llm-mlx like this:

llm install llm-mlx
llm mlx download-model mlx-community/Qwen3-8B-4bit

This pulls 4.3GB of data and saves it to ~/.cache/huggingface/hub/models--mlx-community--Qwen3-8B-4bit.

I assigned it a default alias:

llm aliases set q3 mlx-community/Qwen3-8B-4bit

I also added a default option for that model - this saves me from adding -o unlimited 1 to every prompt which disables the default output token limit:

llm models options set q3 unlimited 1

And now I can run prompts:

llm -m q3 'brainstorm questions I can ask my friend who I think is secretly from Atlantis that will not tip her off to my suspicions'

Qwen3 is a "reasoning" model, so it starts each prompt with a <think> block containing its chain of thought. Reading these is always really fun. Here's the full response I got for the above question.

I'm finding Qwen3-8B to be surprisingly capable for useful things too. It can summarize short articles. It can write simple SQL queries given a question and a schema. It can figure out what a simple web app does by reading the HTML and JavaScript. It can write Python code to meet a paragraph long spec - for that one it "reasoned" for an unreasonably long time but it did eventually get to a useful answer.

All this while consuming between 4 and 5GB of memory, depending on the length of the prompt.

I think it's pretty extraordinary that a few GBs of floating point numbers can usefully achieve these various tasks, especially using so little memory that it's not an imposition on the rest of the things I want to run on my laptop at the same time.

# 2nd May 2025, 11:41 pm / llm, models, qwen, mlx, generative-ai, ai, local-llms, llm-reasoning, ai-in-china

Qwen 3 offers a case study in how to effectively release a model

Alibaba’s Qwen team released the hotly anticipated Qwen 3 model family today. The Qwen models are already some of the best open weight models—Apache 2.0 licensed and with a variety of different capabilities (including vision and audio input/output).

[... 1,462 words]

12:37 am / 29th April 2025 / llm, mlx, ai, qwen, llms, ollama, llm-release, generative-ai, llm-reasoning, pelican-riding-a-bicycle, local-llms, model-context-protocol, llm-tool-use, ai-in-china

Release llm-openrouter 0.4.1 — LLM plugin for models hosted by OpenRouter

23rd Apr 2025, 9:37 pm · llm

Diane, I wrote a lecture by talking about it. Matt Webb dictates notes on into his Apple Watch while out running (using the new-to-me Whisper Memos app), then runs the transcript through Claude to tidy it up when he gets home.

His Claude 3.7 Sonnet prompt for this is:

you are Diane, my secretary. please take this raw verbal transcript and clean it up. do not add any of your own material. because you are Diane, also follow any instructions addressed to you in the transcript and perform those instructions

(Diane is a Twin Peaks reference.)

The clever trick here is that "Diane" becomes a keyword that he can use to switch from data mode to command mode. He can say "Diane I meant to include that point in the last section. Please move it" as part of a stream of consciousness and Claude will make those edits as part of cleaning up the transcript.

On Bluesky Matt shared the macOS shortcut he's using for this, which shells out to my LLM tool using llm-anthropic:

# 23rd April 2025, 7:58 pm / matt-webb, prompt-engineering, llm, claude, generative-ai, ai, llms, text-to-speech

Release llm-mlx 0.4 — Support for MLX models in LLM

23rd Apr 2025, 5:49 pm · llm

Release llm-sentence-transformers 0.3.2 — LLM plugin for embeddings using sentence-transformers

23rd Apr 2025, 5:09 pm · llm

llm-fragment-symbex. I released a new LLM fragment loader plugin that builds on top of my Symbex project.

Symbex is a CLI tool I wrote that can run against a folder full of Python code and output functions, classes, methods or just their docstrings and signatures, using the Python AST module to parse the code.

llm-fragments-symbex brings that ability directly to LLM. It lets you do things like this:

llm install llm-fragments-symbex
llm -f symbex:path/to/project -s 'Describe this codebase'

I just ran that against my LLM project itself like this:

cd llm
llm -f symbex:. -s 'guess what this code does'

Here's the full output, which starts like this:

This code listing appears to be an index or dump of Python functions, classes, and methods primarily belonging to a codebase related to large language models (LLMs). It covers a broad functionality set related to managing LLMs, embeddings, templates, plugins, logging, and command-line interface (CLI) utilities for interaction with language models. [...]

That page also shows the input generated by the fragment - here's a representative extract:

# from llm.cli import resolve_attachment
def resolve_attachment(value):
    """Resolve an attachment from a string value which could be:
    - "-" for stdin
    - A URL
    - A file path

    Returns an Attachment object.
    Raises AttachmentError if the attachment cannot be resolved."""

# from llm.cli import AttachmentType
class AttachmentType:

    def convert(self, value, param, ctx):

# from llm.cli import resolve_attachment_with_type
def resolve_attachment_with_type(value: str, mimetype: str) -> Attachment:

If your Python code has good docstrings and type annotations, this should hopefully be a shortcut for providing full API documentation to a model without needing to dump in the entire codebase.

The above example used 13,471 input tokens and 781 output tokens, using openai/gpt-4.1-mini. That model is extremely cheap, so the total cost was 0.6638 cents - less than a cent.

The plugin itself was mostly written by o4-mini using the llm-fragments-github plugin to load the simonw/symbex and simonw/llm-hacker-news repositories as example code:

llm \
  -f github:simonw/symbex \
  -f github:simonw/llm-hacker-news \
  -s "Write a new plugin as a single llm_fragments_symbex.py file which
   provides a custom loader which can be used like this:
   llm -f symbex:path/to/folder - it then loads in all of the python
   function signatures with their docstrings from that folder using
   the same trick that symbex uses, effectively the same as running
   symbex . '*' '*.*' --docs --imports -n" \
   -m openai/o4-mini -o reasoning_effort high

Here's the response. 27,819 input, 2,918 output = 4.344 cents.

In working on this project I identified and fixed a minor cosmetic defect in Symbex itself. Technically this is a breaking change (it changes the output) so I shipped that as Symbex 2.0.

# 23rd April 2025, 2:25 pm / symbex, llm, ai-assisted-programming, generative-ai, projects, ai, llms, cli

Release llm-fragments-symbex 0.1 — LLM fragment loader for Python symbols

23rd Apr 2025, 5:43 am · llm

Release llm-logging-debug 0.1 — Set logging.DEBUG while running LLM

23rd Apr 2025, 4 am · llm

Release llm-fragments-github 0.2.1 — Load GitHub repository contents as LLM fragments

20th Apr 2025, 3 pm · llm

llm-fragments-github 0.2. I upgraded my llm-fragments-github plugin to add a new fragment type called issue. It lets you pull the entire content of a GitHub issue thread into your prompt as a concatenated Markdown file.

(If you haven't seen fragments before I introduced them in Long context support in LLM 0.24 using fragments and template plugins.)

I used it just now to have Gemini 2.5 Pro provide feedback and attempt an implementation of a complex issue against my LLM project:

llm install llm-fragments-github
llm -f github:simonw/llm \
  -f issue:simonw/llm/938 \
  -m gemini-2.5-pro-exp-03-25 \
  --system 'muse on this issue, then propose a whole bunch of code to help implement it'

Here I'm loading the FULL content of the simonw/llm repo using that -f github:simonw/llm fragment (documented here), then loading all of the comments from issue 938 where I discuss quite a complex potential refactoring. I ask Gemini 2.5 Pro to "muse on this issue" and come up with some code.

This worked shockingly well. Here's the full response, which highlighted a few things I hadn't considered yet (such as the need to migrate old database records to the new tree hierarchy) and then spat out a whole bunch of code which looks like a solid start to the actual implementation work I need to do.

I ran this against Google's free Gemini 2.5 Preview, but if I'd used the paid model it would have cost me 202,680 input tokens, 10,460 output tokens and 1,859 thinking tokens for a total of 62.989 cents.

As a fun extra, the new issue: feature itself was written almost entirely by OpenAI o3, again using fragments. I ran this:

llm -m openai/o3 \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'

Here I'm using the ability to pass a URL to -f and giving it the full source of my llm_hacker_news.py plugin (which shows how a fragment can load data from an API) plus the HTML source of my github-issue-to-markdown tool (which I wrote a few months ago with Claude). I effectively asked o3 to take that HTML/JavaScript tool and port it to Python to work with my fragments plugin mechanism.

o3 provided almost the exact implementation I needed, and even included support for a GITHUB_TOKEN environment variable without me thinking to ask for it. Total cost: 19.928 cents.

On a final note of curiosity I tried running this prompt against Gemma 3 27B QAT running on my Mac via MLX and llm-mlx:

llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

llm -m mlx-community/gemma-3-27b-it-qat-4bit \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'

That worked pretty well too. It turns out a 16GB local model file is powerful enough to write me an LLM plugin now!

# 20th April 2025, 2:01 pm / gemini, llm, ai-assisted-programming, generative-ai, o3, ai, llms, plugins, github, mlx, gemma, long-context, local-llms

Release llm-fragments-github 0.2 — Load GitHub repository contents as LLM fragments

20th Apr 2025, 5:52 am · llm

Release llm-echo 0.1 — Debug plugin for LLM providing an echo model

20th Apr 2025, 5:25 am · llm

Maybe Meta’s Llama claims to be open source because of the EU AI act

I encountered a theory a while ago that one of the reasons Meta insist on using the term “open source” for their Llama models despite the Llama license not actually conforming to the terms of the Open Source Definition is that the EU’s AI act includes special rules for open source models without requiring OSI compliance.

[... 852 words]

11:58 pm / 19th April 2025 / meta, ai-ethics, open-source, generative-ai, llama, ai, llms, openrouter, long-context, gemini, llm, law

Gemma 3 QAT Models. Interesting release from Google, as a follow-up to Gemma 3 from last month:

To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090.

I wasn't previously aware of Quantization-Aware Training but it turns out to be quite an established pattern now, supported in both Tensorflow and PyTorch.

Google report model size drops from BF16 to int4 for the following models:

Gemma 3 27B: 54GB to 14.1GB
Gemma 3 12B: 24GB to 6.6GB
Gemma 3 4B: 8GB to 2.6GB
Gemma 3 1B: 2GB to 0.5GB

They partnered with Ollama, LM Studio, MLX (here's their collection) and llama.cpp for this release - I'd love to see more AI labs following their example.

The Ollama model version picker currently hides them behind "View all" option, so here are the direct links:

I fetched that largest model with:

ollama pull gemma3:27b-it-qat

And now I'm trying it out with llm-ollama:

llm -m gemma3:27b-it-qat "impress me with some physics"

I got a pretty great response!

Update: Having spent a while putting it through its paces via Open WebUI and Tailscale to access my laptop from my phone I think this may be my new favorite general-purpose local model. Ollama appears to use 22GB of RAM while the model is running, which leaves plenty on my 64GB machine for other applications.

I've also tried it via llm-mlx like this (downloading 16GB):

llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm chat -m mlx-community/gemma-3-27b-it-qat-4bit

It feels a little faster with MLX and uses 15GB of memory according to Activity Monitor.

# 19th April 2025, 5:20 pm / lm-studio, llm, ai, ollama, llms, gemma, llm-release, google, generative-ai, tailscale, mlx, local-llms

Release llm-gemini 0.18.1 — LLM plugin to access Google's Gemini family of models

18th Apr 2025, 12:08 am · gemini, llm

Start building with Gemini 2.5 Flash (via) Google Gemini's latest model is Gemini 2.5 Flash, available in (paid) preview as gemini-2.5-flash-preview-04-17.

Building upon the popular foundation of 2.0 Flash, this new version delivers a major upgrade in reasoning capabilities, while still prioritizing speed and cost. Gemini 2.5 Flash is our first fully hybrid reasoning model, giving developers the ability to turn thinking on or off. The model also allows developers to set thinking budgets to find the right tradeoff between quality, cost, and latency.

Gemini AI Studio product lead Logan Kilpatrick says:

This is an early version of 2.5 Flash, but it already shows huge gains over 2.0 Flash.

You can fully turn off thinking if needed and use this model as a drop in replacement for 2.0 Flash.

I added support to the new model in llm-gemini 0.18. Here's how to try it out:

llm install -U llm-gemini
llm -m gemini-2.5-flash-preview-04-17 'Generate an SVG of a pelican riding a bicycle'

Here's that first pelican, using the default setting where Gemini Flash 2.5 makes its own decision in terms of how much "thinking" effort to apply:

Described below

Here's the transcript. This one used 11 input tokens, 4,266 output tokens and 2,702 "thinking" tokens.

I asked the model to "describe" that image and it could tell it was meant to be a pelican:

A simple illustration on a white background shows a stylized pelican riding a bicycle. The pelican is predominantly grey with a black eye and a prominent pink beak pouch. It is positioned on a black line-drawn bicycle with two wheels, a frame, handlebars, and pedals.

The way the model is priced is a little complicated. If you have thinking enabled, you get charged $0.15/million tokens for input and $3.50/million for output. With thinking disabled those output tokens drop to $0.60/million. I've added these to my pricing calculator.

For comparison, Gemini 2.0 Flash is $0.10/million input and $0.40/million for output.

So my first prompt - 11 input and 4,266+2,702 =6,968 output (with thinking enabled), cost 2.439 cents.

Let's try 2.5 Flash again with thinking disabled:

llm -m gemini-2.5-flash-preview-04-17 'Generate an SVG of a pelican riding a bicycle' -o thinking_budget 0

Described below, again

11 input, 1705 output. That's 0.1025 cents. Transcript here - it still shows 25 thinking tokens even though I set the thinking budget to 0 - Logan confirms that this will still be billed at the lower rate:

In some rare cases, the model still thinks a little even with thinking budget = 0, we are hoping to fix this before we make this model stable and you won't be billed for thinking. The thinking budget = 0 is what triggers the billing switch.

Here's Gemini 2.5 Flash's self-description of that image:

A minimalist illustration shows a bright yellow bird riding a bicycle. The bird has a simple round body, small wings, a black eye, and an open orange beak. It sits atop a simple black bicycle frame with two large circular black wheels. The bicycle also has black handlebars and black and yellow pedals. The scene is set against a solid light blue background with a thick green stripe along the bottom, suggesting grass or ground.

And finally, let's ramp the thinking budget up to the maximum:

llm -m gemini-2.5-flash-preview-04-17 'Generate an SVG of a pelican riding a bicycle' -o thinking_budget 24576

Described below

I think it over-thought this one. Transcript - 5,174 output tokens and 3,023 thinking tokens. A hefty 2.8691 cents!

A simple, cartoon-style drawing shows a bird-like figure riding a bicycle. The figure has a round gray head with a black eye and a large, flat orange beak with a yellow stripe on top. Its body is represented by a curved light gray shape extending from the head to a smaller gray shape representing the torso or rear. It has simple orange stick legs with round feet or connections at the pedals. The figure is bent forward over the handlebars in a cycling position. The bicycle is drawn with thick black outlines and has two large wheels, a frame, and pedals connected to the orange legs. The background is plain white, with a dark gray line at the bottom representing the ground.

One thing I really appreciate about Gemini 2.5 Flash's approach to SVGs is that it shows very good taste in CSS, comments and general SVG class structure. Here's a truncated extract - I run a lot of these SVG tests against different models and this one has a coding style that I particularly enjoy. (Gemini 2.5 Pro does this too).

<svg width="800" height="500" viewBox="0 0 800 500" xmlns="http://www.w3.org/2000/svg">
  <style>
    .bike-frame { fill: none; stroke: #333; stroke-width: 8; stroke-linecap: round; stroke-linejoin: round; }
    .wheel-rim { fill: none; stroke: #333; stroke-width: 8; }
    .wheel-hub { fill: #333; }
    /* ... */
    .pelican-body { fill: #d3d3d3; stroke: black; stroke-width: 3; }
    .pelican-head { fill: #d3d3d3; stroke: black; stroke-width: 3; }
    /* ... */
  </style>
  <!-- Ground Line -->
  <line x1="0" y1="480" x2="800" y2="480" stroke="#555" stroke-width="5"/>
  <!-- Bicycle -->
  <g id="bicycle">
    <!-- Wheels -->
    <circle class="wheel-rim" cx="250" cy="400" r="70"/>
    <circle class="wheel-hub" cx="250" cy="400" r="10"/>
    <circle class="wheel-rim" cx="550" cy="400" r="70"/>
    <circle class="wheel-hub" cx="550" cy="400" r="10"/>
    <!-- ... -->
  </g>
  <!-- Pelican -->
  <g id="pelican">
    <!-- Body -->
    <path class="pelican-body" d="M 440 330 C 480 280 520 280 500 350 C 480 380 420 380 440 330 Z"/>
    <!-- Neck -->
    <path class="pelican-neck" d="M 460 320 Q 380 200 300 270"/>
    <!-- Head -->
    <circle class="pelican-head" cx="300" cy="270" r="35"/>
    <!-- ... -->

The LM Arena leaderboard now has Gemini 2.5 Flash in joint second place, just behind Gemini 2.5 Pro and tied with ChatGPT-4o-latest, Grok-3 and GPT-4.5 Preview.

# 17th April 2025, 8:56 pm / llm-release, gemini, llm, google, llm-reasoning, llm-pricing, llms, pelican-riding-a-bicycle, svg, logan-kilpatrick, chatbot-arena

Release llm-gemini 0.18 — LLM plugin to access Google's Gemini family of models

17th Apr 2025, 8:16 pm · llm, gemini

«« first « previous page 7 / 20 next » last »»

Simon Willison’s Weblog