You can now run prompts against images, audio and video in your terminal using LLM

29th October 2024

I released LLM 0.17 last night, the latest version of my combined CLI tool and Python library for interacting with hundreds of different Large Language Models such as GPT-4o, Llama, Claude and Gemini.

The signature feature of 0.17 is that LLM can now be used to prompt multi-modal models—which means you can now use it to send images, audio and video files to LLMs that can handle them.

Processing an image with gpt-4o-mini

Here’s an example. First, install LLM—using brew install llm or pipx install llm or uv tool install llm, pick your favourite. If you have it installed already you made need to upgrade to 0.17, e.g. with brew upgrade llm.

Obtain an OpenAI key (or an alternative, see below) and provide it to the tool:

llm keys set openai
# paste key here

And now you can start running prompts against images.

llm 'describe this image' \
  -a https://static.simonwillison.net/static/2024/pelican.jpg

The -a option stands for --attachment. Attachments can be specified as URLs, as paths to files on disk or as - to read from data piped into the tool.

The above example uses the default model, gpt-4o-mini. I got back this:

The image features a brown pelican standing on rocky terrain near a body of water. The pelican has a distinct coloration, with dark feathers on its body and a lighter-colored head. Its long bill is characteristic of the species, and it appears to be looking out towards the water. In the background, there are boats, suggesting a marina or coastal area. The lighting indicates it may be a sunny day, enhancing the scene’s natural beauty.

Here’s that image:

A photograph of a fine looking pelican in the marina

You can run llm logs --json -c for a hint of how much that cost:

      "usage": {
        "completion_tokens": 89,
        "prompt_tokens": 14177,
        "total_tokens": 14266,

Using my LLM pricing calculator that came to 0.218 cents—less than a quarter of a cent.

Let’s run that again with gpt-4o. Add -m gpt-4o to specify the model:

llm 'describe this image' \
  -a https://static.simonwillison.net/static/2024/pelican.jpg \
  -m gpt-4o

The image shows a pelican standing on rocks near a body of water. The bird has a large, long bill and predominantly gray feathers with a lighter head and neck. In the background, there is a docked boat, giving the impression of a marina or harbor setting. The lighting suggests it might be sunny, highlighting the pelican’s features.

That time it cost 435 prompt tokens (GPT-4o mini charges higher tokens per image than GPT-4o) and the total was 0.1787 cents.

Using a plugin to run audio and video against Gemini

Models in LLM are defined by plugins. The application ships with a default OpenAI plugin to get people started, but there are dozens of other plugins providing access to different models, including models that can run directly on your own device.

Plugins need to be upgraded to add support for multi-modal input—here’s documentation on how to do that. I’ve shipped three plugins with support for multi-modal attachments so far: llm-gemini, llm-claude-3 and llm-mistral (for Pixtral).

So far these are all remote API plugins. It’s definitely possible to build a plugin that runs attachments through local models but I haven’t got one of those into good enough condition to release just yet.

The Google Gemini series are my favourite multi-modal models right now due to the size and breadth of content they support. Gemini models can handle images, audio and video!

Let’s try that out. Start by installing llm-gemini:

llm install llm-gemini

Obtain a Gemini API key. These include a free tier, so you can get started without needing to spend any money. Paste that in here:

llm keys set gemini
# paste key here

The three Gemini 1.5 models are called Pro, Flash and Flash-8B. Let’s try it with Pro:

llm 'describe this image' \
  -a https://static.simonwillison.net/static/2024/pelican.jpg \
  -m gemini-1.5-pro-latest

A brown pelican stands on a rocky surface, likely a jetty or breakwater, with blurred boats in the background. The pelican is facing right, and its long beak curves downwards. Its plumage is primarily grayish-brown, with lighter feathers on its neck and breast. [...]

Very detailed!

But let’s do something a bit more interesting. I shared a 7m40s MP3 of a NotebookLM podcast a few weeks ago. Let’s use Flash-8B—the cheapest Gemini model—to try and obtain a transcript.

llm 'transcript' \
  -a https://static.simonwillison.net/static/2024/video-scraping-pelicans.mp3 \
  -m gemini-1.5-flash-8b-latest

It worked!

Hey everyone, welcome back. You ever find yourself wading through mountains of data, trying to pluck out the juicy bits? It’s like hunting for a single shrimp in a whole kelp forest, am I right? Oh, tell me about it. I swear, sometimes I feel like I’m gonna go cross-eyed from staring at spreadsheets all day. [...]

Full output here.

Once again, llm logs -c --json will show us the tokens used. Here it’s 14754 prompt tokens and 1865 completion tokens. The pricing calculator says that adds up to... 0.0833 cents. Less than a tenth of a cent to transcribe a 7m40s audio clip.

There’s a Python API too

Here’s what it looks like to execute multi-modal prompts with attachments using the LLM Python library:

import llm

model = llm.get_model("gpt-4o-mini")
response = model.prompt(
    "Describe these images",
    attachments=[
        llm.Attachment(path="pelican.jpg"),
        llm.Attachment(
            url="https://static.simonwillison.net/static/2024/pelicans.jpg"
        ),
    ]
)

You can send multiple attachments with a single prompt, and both file paths and URLs are supported—or even binary content, using llm.Attachment(content=b'binary goes here').

Any model plugin becomes available to Python with the same interface, making this LLM library a useful abstraction layer to try out the same prompts against many different models, both local and remote.

What can we do with this?

I’ve only had this working for a couple of days and the potential applications are somewhat dizzying. It’s trivial to spin up a Bash script that can do things like generate alt= text for every image in a directory, for example. Here’s one Claude wrote just now:

#!/bin/bash
for img in *.{jpg,jpeg}; do
    if [ -f "$img" ]; then
        output="${img%.*}.txt"
        llm -m gpt-4o-mini 'return just the alt text for this image' "$img" > "$output"
    fi
done

On the #llm Discord channel Drew Breunig suggested this one-liner:

llm prompt -m gpt-4o "
tell me if it's foggy in this image, reply on a scale from
1-10 with 10 being so foggy you can't see anything and 1
being clear enough to see the hills in the distance.
Only respond with a single number." \
  -a https://cameras.alertcalifornia.org/public-camera-data/Axis-Purisma1/latest-frame.jpg

That URL is to a live webcam feed, so here’s an instant GPT-4o vision powered weather report!

We can have so much fun with this stuff.

All of the usual AI caveats apply: it can make mistakes, it can hallucinate, safety filters may kick in and refuse to transcribe audio based on the content. A lot of work is needed to evaluate how well the models perform at different tasks. There’s a lot still to explore here.

But at 1/10th of a cent for 7 minutes of audio at least those explorations can be plentiful and inexpensive!

Update 12th November 2024: If you want to try running prompts against images using a local model that runs on your own machine you can now do so using Ollama, llm-ollama and Llama 3.2 Vision.

Posted 29th October 2024 at 3:09 pm · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog