Simon Willison’s Weblog

Subscribe

Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25)

5th May 2025

The new llm-video-frames plugin can turn a video file into a sequence of JPEG frames and feed them directly into a long context vision LLM such as GPT-4.1, even when that LLM doesn’t directly support video input. It depends on a plugin feature I added to LLM 0.25, which I released last night.

Here’s how to try it out:

brew install ffmpeg # or apt-get or your package manager of choice
uv tool install llm # or pipx install llm or pip install llm
llm install llm-video-frames
llm keys set openai
# Paste your OpenAI API key here

llm -f video-frames:video.mp4 \
  'describe the key scenes in this video' \
  -m gpt-4.1-mini

The video-frames:filepath.mp4 syntax is provided by the new plugin. It takes that video, converts it to a JPEG for every second of the video and then turns those into attachments that can be passed to the LLM.

Here’s a demo, against this video of Cleo:

llm -f video-frames:cleo.mp4 'describe key moments' -m gpt-4.1-mini

And the output from the model (transcript here):

The sequence of images captures the key moments of a dog being offered and then enjoying a small treat:

  1. In the first image, a hand is holding a small cupcake with purple frosting close to a black dog’s face. The dog looks eager and is focused intently on the treat.
  2. The second image shows the dog beginning to take a bite of the cupcake from the person’s fingers. The dog’s mouth is open, gently nibbling on the treat.
  3. In the third image, the dog has finished or is almost done with the treat and looks content, with a slight smile and a relaxed posture. The treat is no longer in the person’s hand, indicating that the dog has consumed it.

This progression highlights the dog’s anticipation, enjoyment, and satisfaction with the treat.

Total cost: 7,072 input tokens, 156 output tokens—for GPT-4.1 mini that’s 0.3078 cents (less than a third of a cent).

In this case the plugin converted the video into three images: frame_00001.jpg, frame_00002.jpg and frame_00003.jpg.

The plugin accepts additional arguments. You can increase the frames-per-second using ?fps=2—for example:

llm -f 'video-frames:video.mp4?fps=2' 'summarize this video'

Or you can add ?timestamps=1 to cause ffmpeg to overlay a timestamp in the bottom right corner of each frame. This gives the model a chance to return timestamps in its output.

Let’s try that with the Cleo video:

llm -f 'video-frames:cleo.mp4?timestamps=1&fps=5' \
  'key moments, include timestamps' -m gpt-4.1-mini

Here’s the output (transcript here):

Here are the key moments from the video “cleo.mp4” with timestamps:

  • 00:00:00.000—A dog on leash looks at a small cupcake with purple frosting being held by a person.
  • 00:00:00.800—The dog closely sniffs the cupcake.
  • 00:00:01.400—The person puts a small amount of the cupcake frosting on their finger.
  • 00:00:01.600—The dog starts licking the frosting from the person’s finger.
  • 00:00:02.600—The dog continues licking enthusiastically.

Let me know if you need more details or a description of any specific part.

That one sent 14 images to the API, at a total cost of 32,968 input, 141 output = 1.3413 cents.

It sent 5.9MB of image data to OpenAI’s API, encoded as base64 in the JSON API call.

The GPT-4.1 model family accepts up to 1,047,576 tokens. Aside from a 20MB size limit per image I haven’t seen any documentation of limits on the number of images. You can fit a whole lot of JPEG frames in a million tokens!

Here’s what one of those frames looks like with the timestamp overlaid in the corner:

Cleo taking a treat from my fingers, in the bottom right corner is an overlay t hat says cleo.mp4 00:00:01.600

How I built the plugin with o4-mini

This is a great example of how rapid prototyping with an LLM can help demonstrate the value of a feature.

I was considering whether it would make sense for fragment plugins to return images in issue 972 when I had the idea to use ffmpeg to split a video into frames.

I know from past experience that a good model can write an entire plugin for LLM if you feed it the right example, so I started with this (reformatted here for readability):

llm -m o4-mini -f github:simonw/llm-hacker-news -s 'write a new plugin called llm_video_frames.py which takes video:path-to-video.mp4 and creates a temporary directory which it then populates with one frame per second of that video using ffmpeg - then it returns a list of [llm.Attachment(path="path-to-frame1.jpg"), ...] - it should also support passing video:video.mp4?fps=2 to increase to two frames per second, and if you pass ?timestamps=1 or &timestamps=1 then it should add a text timestamp to the bottom right conner of each image with the mm:ss timestamp of that frame (or hh:mm:ss if more than one hour in) and the filename of the video without the path as well.' -o reasoning_effort high

Here’s the transcript.

The new attachment mechanism went from vague idea to “I should build that” as a direct result of having an LLM-built proof-of-concept that demonstrated the feasibility of the new feature.

The code it produced was about 90% of the code I shipped in the finished plugin. Total cost 5,018 input, 2,208 output = 1.5235 cents.

Annotated release notes for everything else in LLM 0.25

Here are the annotated release notes for everything else:

As decsribed above. The inspiration for this feature came from the llm-arxiv plugin by agustif.

  • New OpenAI models: gpt-4.1, gpt-4.1-mini, gpt-41-nano, o3, o4-mini. #945, #965, #976.

My original plan was to leave these models exclusively to the new llm-openai plugin, since that allows me to add support for new models without a full LLM release. I’m going to punt on that until I’m ready to entirely remove the OpenAI models from LLM core.

  • New environment variables: LLM_MODEL and LLM_EMBEDDING_MODEL for setting the model to use without needing to specify -m model_id every time. #932

A convenience feature for if you want to set the default model for a terminal session with LLM without using the global default model" mechanism.

  • New command: llm fragments loaders, to list all currently available fragment loader prefixes provided by plugins. #941

Mainly for consistence with the existing llm templates loaders command. Here’s the output when I run llm fragments loaders on my machine:

docs:
  Fetch the latest documentation for the specified package from
  https://github.com/simonw/docs-for-llms

  Use '-f docs:' for the documentation of your current version of LLM.

docs-preview:
  Similar to docs: but fetches the latest docs including alpha/beta releases.

symbex:
  Walk the given directory, parse every .py file, and for every
  top-level function or class-method produce its signature and
  docstring plus an import line.

github:
  Load files from a GitHub repository as fragments

  Argument is a GitHub repository URL or username/repository

issue:
  Fetch GitHub issue/pull and comments as Markdown

  Argument is either "owner/repo/NUMBER" or URL to an issue

pr:
  Fetch GitHub pull request with comments and diff as Markdown

  Argument is either "owner/repo/NUMBER" or URL to a pull request

hn:
  Given a Hacker News article ID returns the full nested conversation.

  For example: -f hn:43875136

video-frames:
  Fragment loader "video-frames:<path>?fps=N&timestamps=1"
  - extracts frames at `fps` per second (default 1)
  - if `timestamps=1`, overlays "filename hh:mm:ss" at bottom-right

That’s from llm-docs, llm-fragments-symbex, llm-fragments-github, llm-hacker-news and llm-video-frames.

  • llm fragments command now shows fragments ordered by the date they were first used. #973

This makes it easier to quickly debug a new fragment plugin—you can run llm fragments and glance at the bottom few entries.

I’ve also been using the new llm-echo debugging plugin for this—it adds a new fake model called “echo” which simply outputs whatever the prompt, system prompt, fragments and attachments are that were passed to the model:

llm -f docs:sqlite-utils -m echo 'Show me the context'

Output here.

  • llm chat now includes a !edit command for editing a prompt using your default terminal text editor. Thanks, Benedikt Willi. #969

This is a really nice enhancement to llm chat, making it much more convenient to edit longe prompts.

And the rest:

  • Allow -t and --system to be used at the same time. #916
  • Fixed a bug where accessing a model via its alias would fail to respect any default options set for that model. #968
  • Improved documentation for extra-openai-models.yaml. Thanks, Rahim Nathwani and Dan Guido. #950, #957
  • llm -c/--continue now works correctly with the -d/--database option. llm chat now accepts that -d/--database option. Thanks, Sukhbinder Singh. #933

This is Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25) by Simon Willison, posted on 5th May 2025.

Part of series New releases of LLM

  1. LLM 0.22, the annotated release notes - Feb. 17, 2025, 6:19 a.m.
  2. Structured data extraction from unstructured content using LLM schemas - Feb. 28, 2025, 5:07 p.m.
  3. Long context support in LLM 0.24 using fragments and template plugins - April 7, 2025, 5:45 p.m.
  4. Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25) - May 5, 2025, 5:38 p.m.

Previous: Two publishers and three authors fail to understand what "vibe coding" means