<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: onnx</title><link href="http://feeds.simonwillison.net/" rel="alternate"/><link href="http://feeds.simonwillison.net/tags/onnx.atom" rel="self"/><id>http://feeds.simonwillison.net/</id><updated>2026-06-22T23:43:51+00:00</updated><author><name>Simon Willison</name></author><entry><title>Porting the Moebius 0.2B image inpainting model to run in the browser with Claude Code</title><link href="https://simonwillison.net/2026/Jun/22/porting-moebius/#atom-tag" rel="alternate"/><published>2026-06-22T23:43:51+00:00</published><updated>2026-06-22T23:43:51+00:00</updated><id>https://simonwillison.net/2026/Jun/22/porting-moebius/#atom-tag</id><summary type="html">
    &lt;p&gt;This morning &lt;a href="https://news.ycombinator.com/item?id=48630171"&gt;on Hacker News&lt;/a&gt; I saw &lt;a href="https://hustvl.github.io/Moebius/"&gt;Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance&lt;/a&gt;, describing a small but effective inpainting model - a model where you can mark regions of an image to remove and the model imagines what should fill the space. The released model &lt;a href="https://github.com/hustvl/Moebius/blob/9310b76e368f5f7a8ecdf06493231af279c9973b/requirements.txt#L1"&gt;required PyTorch and NVIDIA CUDA&lt;/a&gt;, but since it described itself as 0.2B I decided to try and get it running using WebGPU in a browser. TL;DR: I got it working, and you can try the demo at &lt;a href="https://simonw.github.io/moebius-web/"&gt;simonw.github.io/moebius-web/&lt;/a&gt;. Read on for the details.&lt;/p&gt;
&lt;h4 id="the-finished-tool"&gt;The finished tool&lt;/h4&gt;
&lt;p&gt;Here's a video demo of the finished tool:&lt;/p&gt;

&lt;video
width="1280"
height="1070"
poster="https://static.simonwillison.net/static/2026/inpainting_1280_poster.jpg"
preload="none"
controls="controls"
playsinline="playsinline"
style="max-width:100%;height:auto"&gt;
&lt;source src="https://static.simonwillison.net/static/2026/inpainting_1280.mp4" type="video/mp4" /&gt;
&lt;/video&gt;

&lt;p&gt;You can open any image in it (non-square images get letterboxed), highlight areas to remove, click the "Run inpaint" button and wait for the model to do its magic.&lt;/p&gt;
&lt;h4 id="a-parallel-agent-side-project"&gt;A parallel agent side-project&lt;/h4&gt;
&lt;p&gt;My main project for today was landing a major feature in Datasette: a UI for creating and altering tables, as a follow-up to the &lt;a href="https://simonwillison.net/2026/Jun/16/datasette/"&gt;insert and edit rows feature&lt;/a&gt; I released last week.&lt;/p&gt;
&lt;p&gt;I was working on that in Codex Desktop (here's &lt;a href="https://github.com/simonw/datasette/pull/2789"&gt;the PR&lt;/a&gt;) and often found myself spending 5-10 minutes spinning my fingers waiting for it to complete a mid-sized refactor or add the finishing touches to a change to the UI.&lt;/p&gt;
&lt;p&gt;(An amusing thing about coding agents is that the harder a problem is the &lt;em&gt;more&lt;/em&gt; time you have to get distracted while you wait for them to finish crunching!)&lt;/p&gt;
&lt;p&gt;So I decided to spin up Claude Code in a terminal window and see how far I could get at porting Moebius to the web.&lt;/p&gt;
&lt;h4 id="some-agentic-research-to-kick-off-the-project"&gt;Some agentic research to kick off the project&lt;/h4&gt;
&lt;p&gt;My first step was to ask regular Claude about the feasibility of this project. In &lt;a href="https://claude.ai/"&gt;Claude.ai&lt;/a&gt;, which has the ability to clone repos from GitHub:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone https://github.com/hustvl/Moebius/ and tell me if they published the code and weights to run this model anywhere&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(I hadn't spotted the link to the weights yet, that's tucked away in the "News" section.)&lt;/p&gt;
&lt;p&gt;Then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;For Moebius what are the options for running it right now - Python and NVIDIA CUDA only or other options too?&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Muse on the feasibility of porting it to Transformers.js or similar and running it in a browser&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I like telling models to "muse on X", it's the shortest way I've found of expressing that I want them to contemplate a problem for me without providing them with a concrete goal.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://claude.ai/share/551c3dc8-17ce-4a4b-a0c9-8cbded6c7bf1"&gt;that chat transcript&lt;/a&gt;. I copied out the last answer and saved it as &lt;a href="https://github.com/simonw/moebius-web/blob/main/research.md"&gt;research.md&lt;/a&gt; for Claude Code to read later.&lt;/p&gt;
&lt;p&gt;Claude suggested using &lt;strong&gt;ONNX Runtime Web on the WebGPU backend&lt;/strong&gt; - the layer &lt;em&gt;below&lt;/em&gt; the &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;Transformers.js&lt;/a&gt; library I had suggested.&lt;/p&gt;
&lt;p&gt;That was enough to convince me it was worth setting Claude Code loose and seeing how far it could get.&lt;/p&gt;
&lt;p&gt;I usually start projects like this by gathering as much information as the coding agent might need as possible. Since I didn't expect this project to actually work I did everything in my &lt;code&gt;/tmp&lt;/code&gt; folder:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp
mkdir Moebius
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; Moebius
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Grab the Moebius python code&lt;/span&gt;
git clone https://github.com/hustvl/Moebius
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; And the model weights (Claude figured this out):&lt;/span&gt;
GIT_LFS_SKIP_SMUDGE=0 git clone \
  https://huggingface.co/hustvl/Moebius Moebius-weights
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Finally a couple of libraries we might use:&lt;/span&gt;
git clone https://github.com/huggingface/transformers.js
git clone https://github.com/microsoft/onnxruntime&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="setting-off-claude-code"&gt;Setting off Claude Code&lt;/h4&gt;
&lt;p&gt;I created a directory for the rest of the project and ran &lt;code&gt;git init&lt;/code&gt; in that so Claude could start committing code notes:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;mkdir /tmp/Moebius/moebius-web
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp/Moebius/moebius-web
git init
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Copy in that research.md from earlier&lt;/span&gt;
git add research.md
git commit -m &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Initial research by Claude Opus 4.8&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I fired up a &lt;code&gt;claude&lt;/code&gt; instance in the &lt;code&gt;/tmp/Moebius&lt;/code&gt; folder, the level above all of the research materials I had prepared for it. I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Read ./moebius-web/research.md - your goal is to port this model to ONNX and WebGPU so we can run it directly in a browser, with a simple UI&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As it started to work I dropped in this follow-up (typos included):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Bulid this in /tmp/Moebius/moebius-web and commit early and often, also maintain a notes.md file in there with notes about what you figure out along the way - also start by writing out a plan.md in there and update that plan as oy work too&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I often ask agents to keep notes like this - the end result is often interesting, both for myself and for the next agent session that touches the same project. Here's what that &lt;a href="https://github.com/simonw/moebius-web/blob/main/notes.md"&gt;notes.md file&lt;/a&gt; looked like at the end of the project.&lt;/p&gt;
&lt;p&gt;I kicked it off and went back to my main project, checking in occasionally to see how Claude was doing. When it looked like it might have something that worked I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Tell me what URL I can visit in my own browser to try this&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I tried it out in Chrome and pasted some errors (and screenshots of errors) back into Claude Code.&lt;/p&gt;
&lt;p&gt;After a few rounds of this we had something that appeared to work! Time to put it on the internet so other people could use it.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;How would we publish this to Hugging Face such that the model weights were on there and the HTML demo would show up in Hugging Face spaces?&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code knows how to use the &lt;code&gt;hf&lt;/code&gt; CLI tool, so I created a model repo on &lt;a href="https://huggingface.co/"&gt;Hugging Face&lt;/a&gt;, then &lt;a href="https://huggingface.co/settings/tokens"&gt;created a token&lt;/a&gt; that could write to that repo and dropped it into a &lt;code&gt;/tmp/Moebius/token.txt&lt;/code&gt; file so Claude could use it.&lt;/p&gt;
&lt;p&gt;It published the 1.24GB of converted ONNX weights to &lt;a href="https://huggingface.co/simonw/Moebius-ONNX"&gt;huggingface.co/simonw/Moebius-ONNX&lt;/a&gt; for me.&lt;/p&gt;
&lt;p&gt;I'd seen other demos load weights into the browser from Hugging Face before, so I knew it was possible. I decided to host my own frontend code on GitHub Pages, so I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want to publish the moebius-web folder to GitHub, minus the large files (so maybe minus the models/ folder), such that when I turn on GitHub Pages for that repo navigating to https://simonw.github.io/moebius-web/ serves the UI&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Telling it the final URL was important in case it needed to fix the URLs in the demos that it was building so they would work when deployed to production.&lt;/p&gt;
&lt;p&gt;After a few more rounds of iteration, in between working on my main project, we got to a working, deployed version!&lt;/p&gt;
&lt;p&gt;Except... each time I reloaded the page it seemed to download ~1.3GB of model weights. Browser caching seemed pretty important for this!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;anything clever we can do with serviceworkers or similar to help cache this stuff? It seems to reload every time, I am concerned that there might be something weird about the way HF redirects work that mean we don't benefit from browser caching&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I knew that Transformers.js projects could handle this properly, so I grabbed a copy of the &lt;a href="https://huggingface.co/spaces/Xenova/whisper-web"&gt;Whisper Web&lt;/a&gt; demo, dropped it into &lt;code&gt;/tmp/Moebius/whisper-web&lt;/code&gt; and said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;look in /tmp/Moebius/whisper-web (with a subagent) and see how they do this&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That project was entirely obfuscated, built JavaScript files so I figured using a subagent would avoid spending the rest of my top-level token context deciphering those files.&lt;/p&gt;
&lt;p&gt;Claude figured out that it was using &lt;code&gt;caches.open("transformers-cache")&lt;/code&gt; - the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/CacheStorage/open"&gt;CacheStorage API&lt;/a&gt; - and &lt;a href="https://github.com/simonw/moebius-web/commit/05c1cbc4894460a70a8bc1718ac6d152219e0f28#diff-fb89c342dfa36f544a2d16a885b0f3d1d49f436a7d0eaeb80505f80a1f922603"&gt;added that to our project&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I've shared the &lt;a href="https://gisthost.github.io/?58039ba5c1ca3ed177e8659168996ee4"&gt;full Claude Code transcript&lt;/a&gt; for this project (published using my &lt;a href="https://github.com/simonw/claude-code-transcripts"&gt;claude-code-transcripts&lt;/a&gt; tool).&lt;/p&gt;
&lt;h4 id="what-did-i-learn-from-all-of-this-"&gt;What did I learn from all of this?&lt;/h4&gt;
&lt;p&gt;This definitely counts as vibe coding: I didn't look at a single line of code from the project, restricting my input to testing, suggesting small feature improvements (like a progress bar for the large file downloads) and pointing the model in the direction of examples of how I wanted things to work.&lt;/p&gt;
&lt;p&gt;Since I didn't write any code the amount I learned about the underlying technologies - WebGPU, ONNX, and the Moebius model itself - was very limited.&lt;/p&gt;
&lt;p&gt;As is usually the case with this kind of project the most important things I learned concerned what was &lt;em&gt;possible&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Claude Opus 4.8 is capable of converting a PyTorch model to ONNX, publishing the result to Hugging Face and then building out a web application and interface that can load and execute that model.&lt;/li&gt;
&lt;li&gt;Chrome, Firefox and Safari are all now capable of running this kind of model - I tried it in all three.&lt;/li&gt;
&lt;li&gt;The CacheStorage API works with ~1.3GB model files.&lt;/li&gt;
&lt;li&gt;... which means we can have inpainting as a feature of a client-only web application! (If our users can tolerate the 1.3GB download.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I felt like I should probably try and learn a little more about my project. I fired up &lt;a href="https://claude.ai/"&gt;Claude.ai&lt;/a&gt; and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone https://github.com/simonw/moebius-web/ and use it to teach me all about the model and ONNX and the process of converting a model to ONNX and WebGPU and basically everything I'd need to know in order to fully understand this repo&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://claude.ai/share/d11b8f2b-a52d-4ca2-be75-a710eaf18572"&gt;the transcript&lt;/a&gt; and the &lt;a href="https://github.com/simonw/moebius-web/blob/main/understanding.md"&gt;understanding.md&lt;/a&gt; Markdown file it created, which I've now added to the GitHub repo. I found the explanation of ONNX particularly enlightening:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ONNX&lt;/strong&gt; (Open Neural Network Exchange) is a portable, framework-neutral file format for neural networks. An &lt;code&gt;.onnx&lt;/code&gt; file is essentially two things bundled together:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A computation graph&lt;/strong&gt; — a directed graph of &lt;em&gt;nodes&lt;/em&gt;, where each node is an &lt;strong&gt;operator&lt;/strong&gt; (&lt;code&gt;Conv&lt;/code&gt;, &lt;code&gt;MatMul&lt;/code&gt;, &lt;code&gt;Add&lt;/code&gt;, &lt;code&gt;Einsum&lt;/code&gt;, &lt;code&gt;Softmax&lt;/code&gt;, &lt;code&gt;Gather&lt;/code&gt;, &lt;code&gt;Resize&lt;/code&gt;, …) wired together by named tensors flowing between them. This is the "recipe" for the forward pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The weights&lt;/strong&gt; — the learned parameter tensors (the convolution kernels, the embedding table, etc.), stored as initializers in that same graph.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Crucially, ONNX describes &lt;em&gt;what to compute&lt;/em&gt;, abstractly, without saying &lt;em&gt;how&lt;/em&gt; or &lt;em&gt;on what hardware&lt;/em&gt;. The operator set is versioned by an &lt;strong&gt;opset&lt;/strong&gt; number (this repo uses &lt;strong&gt;opset 18&lt;/strong&gt;), which pins down exactly which operators exist and what their semantics are.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It turns out PyTorch has built in mechanisms for exporting to ONNX, as seen &lt;a href="https://github.com/simonw/moebius-web/blob/080be6e737ec976130e260d34707d7d9b7f63d5b/python/export_onnx.py#L91"&gt;here in export_onnx.py&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;torch&lt;/span&gt;.&lt;span class="pl-c1"&gt;onnx&lt;/span&gt;.&lt;span class="pl-c1"&gt;export&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;dec&lt;/span&gt;, (&lt;span class="pl-s1"&gt;lat&lt;/span&gt;,), &lt;span class="pl-s1"&gt;dec_path&lt;/span&gt;, &lt;span class="pl-s1"&gt;opset_version&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;args&lt;/span&gt;.&lt;span class="pl-c1"&gt;opset&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;input_names&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s"&gt;"latent"&lt;/span&gt;], &lt;span class="pl-s1"&gt;output_names&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s"&gt;"image"&lt;/span&gt;],
    &lt;span class="pl-s1"&gt;dynamic_axes&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;{&lt;span class="pl-s"&gt;"latent"&lt;/span&gt;: {&lt;span class="pl-c1"&gt;0&lt;/span&gt;: &lt;span class="pl-s"&gt;"B"&lt;/span&gt;}, &lt;span class="pl-s"&gt;"image"&lt;/span&gt;: {&lt;span class="pl-c1"&gt;0&lt;/span&gt;: &lt;span class="pl-s"&gt;"B"&lt;/span&gt;}},
)&lt;/pre&gt;
&lt;p&gt;Claude also included a &lt;a href="https://github.com/simonw/moebius-web/blob/main/understanding.md#12-mini-glossary"&gt;handy glossary&lt;/a&gt; and an only-slightly-broken &lt;a href="https://github.com/simonw/moebius-web/blob/main/understanding.md#10-putting-the-whole-pipeline-in-one-picture"&gt;ASCII-art diagram&lt;/a&gt; showing how the model pipeline fits together.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgl"&gt;webgl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="browsers"/><category term="transformers-js"/><category term="webgl"/><category term="vibe-coding"/><category term="coding-agents"/><category term="claude-code"/><category term="onnx"/></entry><entry><title>Load Llama-3.2 WebGPU in your browser from a local folder</title><link href="https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag" rel="alternate"/><published>2025-09-08T20:53:52+00:00</published><updated>2025-09-08T20:53:52+00:00</updated><id>https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;Load Llama-3.2 WebGPU in your browser from a local folder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by &lt;a href="https://news.ycombinator.com/item?id=45168953#45169054"&gt;a comment&lt;/a&gt; on Hacker News I decided to see if it was possible to modify the &lt;a href="https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu"&gt;transformers.js-examples/tree/main/llama-3.2-webgpu&lt;/a&gt; Llama 3.2 chat demo (&lt;a href="https://huggingface.co/spaces/webml-community/llama-3.2-webgpu"&gt;online here&lt;/a&gt;, I &lt;a href="https://simonwillison.net/2024/Sep/30/llama-32-webgpu/"&gt;wrote about it last November&lt;/a&gt;) to add an option to open a local model file directly from a folder on disk, rather than waiting for it to download over the network.&lt;/p&gt;
&lt;p&gt;I posed the problem to OpenAI's GPT-5-enabled Codex CLI like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/huggingface/transformers.js-examples
cd transformers.js-examples/llama-3.2-webgpu
codex
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a "download model" option too.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex churned away for several minutes, even running commands like &lt;code&gt;curl -sL https://raw.githubusercontent.com/huggingface/transformers.js/main/src/models.js | sed -n '1,200p'&lt;/code&gt; to inspect the source code of the underlying Transformers.js library.&lt;/p&gt;
&lt;p&gt;After four prompts total (&lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751814#gistcomment-5751814"&gt;shown here&lt;/a&gt;) it built something which worked!&lt;/p&gt;
&lt;p&gt;To try it out you'll need your own local copy of the Llama 3.2 ONNX model. You can get that (a ~1.2GB) download) like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git lfs install
git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then visit my &lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;llama-3.2-webgpu&lt;/a&gt; page in Chrome or Firefox Nightly (since WebGPU is required), click "Browse folder", select that folder you just cloned, agree to the "Upload" confirmation (confusing since nothing is uploaded from your browser, the model file is opened locally on your machine) and click "Load local model".&lt;/p&gt;
&lt;p&gt;Here's an animated demo (recorded in real-time, I didn't speed this up):&lt;/p&gt;
&lt;p&gt;&lt;img alt="GIF. I follow the setup instructions, clicking to load a local model and browsing to the correct folder. Once loaded the model shows a chat interface, I run the example about time management which returns tokens at about 10/second." src="https://static.simonwillison.net/static/2025/webgpu-llama-demo-small.gif" /&gt;&lt;/p&gt;
&lt;p&gt;I pushed &lt;a href="https://github.com/simonw/transformers.js-examples/commit/cdebf4128c6e30414d437affd4b13b6c9c79421d"&gt;a branch with those changes here&lt;/a&gt;. The next step would be to modify this to support other models in addition to the Llama 3.2 demo, but I'm pleased to have got to this proof of concept with so little work beyond throwing some prompts at Codex to see if it could figure it out.&lt;/p&gt;
&lt;p&gt;According to the Codex &lt;code&gt;/status&lt;/code&gt; command &lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751807#gistcomment-5751807"&gt;this used&lt;/a&gt; 169,818 input tokens, 17,112 output tokens and 1,176,320 cached input tokens. At current GPT-5 token pricing ($1.25/million input, $0.125/million cached input, $10/million output) that would cost 53.942 cents, but Codex CLI hooks into my existing $20/month ChatGPT Plus plan so this was bundled into that.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45168953#45173297"&gt;My Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="transformers-js"/><category term="webgpu"/><category term="llm-pricing"/><category term="vibe-coding"/><category term="gpt-5"/><category term="codex"/><category term="gpt"/><category term="onnx"/></entry><entry><title>llama-3.2-webgpu</title><link href="https://simonwillison.net/2024/Sep/30/llama-32-webgpu/#atom-tag" rel="alternate"/><published>2024-09-30T16:27:22+00:00</published><updated>2024-09-30T16:27:22+00:00</updated><id>https://simonwillison.net/2024/Sep/30/llama-32-webgpu/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/spaces/webml-community/llama-3.2-webgpu"&gt;llama-3.2-webgpu&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Llama 3.2 1B is a really interesting models, given its 128,000 token input and its tiny size (barely more than a GB).&lt;/p&gt;
&lt;p&gt;This page loads a &lt;a href="https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16/tree/main/onnx"&gt;1.24GB q4f16 ONNX build&lt;/a&gt; of the Llama-3.2-1B-Instruct model and runs it with a React-powered chat interface directly in the browser, using &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;Transformers.js&lt;/a&gt; and WebGPU. &lt;a href="https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu"&gt;Source code for the demo is here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It worked for me just now in Chrome; in Firefox and Safari I got a “WebGPU is not supported by this browser” error message.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/xenovacom/status/1840767709317046460"&gt;@xenovacom&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="transformers-js"/><category term="webgpu"/><category term="onnx"/></entry><entry><title>Transformer Explainer</title><link href="https://simonwillison.net/2024/Aug/11/transformer-explainer/#atom-tag" rel="alternate"/><published>2024-08-11T22:56:33+00:00</published><updated>2024-08-11T22:56:33+00:00</updated><id>https://simonwillison.net/2024/Aug/11/transformer-explainer/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://poloclub.github.io/transformer-explainer/"&gt;Transformer Explainer&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a very neat interactive visualization (with accompanying essay and video - scroll down for those) that explains the Transformer architecture for LLMs, using a GPT-2 model running directly in the browser using the ONNX runtime and Andrej Karpathy's nanoGPT project.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the Transformer Explainer interface, running a prompt &amp;quot;the sky is&amp;quot; which returns &amp;quot;blue&amp;quot; as the most obvious next word." src="https://static.simonwillison.net/static/2024/transformer-explainer.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/d3"&gt;d3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-2"&gt;gpt-2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="explorables"/><category term="d3"/><category term="generative-ai"/><category term="llms"/><category term="gpt-2"/><category term="onnx"/></entry><entry><title>Experimenting with local alt text generation in Firefox Nightly</title><link href="https://simonwillison.net/2024/Jun/2/experimenting-with-local-alt-text-generation-in-firefox-nightly/#atom-tag" rel="alternate"/><published>2024-06-02T13:12:44+00:00</published><updated>2024-06-02T13:12:44+00:00</updated><id>https://simonwillison.net/2024/Jun/2/experimenting-with-local-alt-text-generation-in-firefox-nightly/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hacks.mozilla.org/2024/05/experimenting-with-local-alt-text-generation-in-firefox-nightly/"&gt;Experimenting with local alt text generation in Firefox Nightly&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The PDF editor in Firefox (confession: I did not know Firefox ships with a PDF editor) is getting an experimental feature that can help suggest alt text for images for the human editor to then adapt and improve on.&lt;/p&gt;
&lt;p&gt;This is a great application of AI, made all the more interesting here because Firefox will run a local model on-device for this, using a custom trained model they describe as "our 182M parameters model using a Distilled version of GPT-2 alongside a Vision Transformer (ViT) image encoder".&lt;/p&gt;
&lt;p&gt;The model uses WebAssembly with ONNX running in &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;Transfomers.js&lt;/a&gt;, and will be downloaded the first time the feature is put to use.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/mozhacks/status/1796774672639336804"&gt;@mozhacks&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/accessibility"&gt;accessibility&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alt-text"&gt;alt-text&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefox"&gt;firefox&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="accessibility"/><category term="alt-text"/><category term="firefox"/><category term="javascript"/><category term="mozilla"/><category term="pdf"/><category term="ai"/><category term="webassembly"/><category term="llms"/><category term="transformers-js"/><category term="onnx"/></entry><entry><title>unstructured</title><link href="https://simonwillison.net/2024/Feb/2/unstructured/#atom-tag" rel="alternate"/><published>2024-02-02T02:47:15+00:00</published><updated>2024-02-02T02:47:15+00:00</updated><id>https://simonwillison.net/2024/Feb/2/unstructured/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Unstructured-IO/unstructured"&gt;unstructured&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.&lt;/p&gt;

&lt;p&gt;I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.&lt;/p&gt;

&lt;p&gt;There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="pdf"/><category term="python"/><category term="onnx"/></entry><entry><title>llm-embed-onnx</title><link href="https://simonwillison.net/2024/Jan/28/llm-embed-onnx/#atom-tag" rel="alternate"/><published>2024-01-28T22:28:44+00:00</published><updated>2024-01-28T22:28:44+00:00</updated><id>https://simonwillison.net/2024/Jan/28/llm-embed-onnx/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-embed-onnx"&gt;llm-embed-onnx&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I wrote a new plugin for LLM that acts as a thin wrapper around onnx_embedding_models by Benjamin Anderson, providing access to seven embedding models that can run on the ONNX model framework.&lt;/p&gt;

&lt;p&gt;The actual plugin is around 50 lines of code, which makes for a nice example of how thin a plugin wrapper can be that adds new models to my LLM tool.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/embedding"&gt;embedding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="embedding"/><category term="plugins"/><category term="projects"/><category term="ai"/><category term="llm"/><category term="onnx"/></entry><entry><title>llm-embed-onnx 0.1</title><link href="https://simonwillison.net/2024/Jan/28/llm-embed-onnx-2/#atom-tag" rel="alternate"/><published>2024-01-28T22:21:18+00:00</published><updated>2024-01-28T22:21:18+00:00</updated><id>https://simonwillison.net/2024/Jan/28/llm-embed-onnx-2/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-embed-onnx/releases/tag/0.1"&gt;llm-embed-onnx 0.1&lt;/a&gt;&lt;/p&gt;
        
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="onnx"/></entry><entry><title>Perplexity: interactive LLM visualization</title><link href="https://simonwillison.net/2023/Sep/6/perplexity/#atom-tag" rel="alternate"/><published>2023-09-06T03:33:05+00:00</published><updated>2023-09-06T03:33:05+00:00</updated><id>https://simonwillison.net/2023/Sep/6/perplexity/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://perplexity.vercel.app/"&gt;Perplexity: interactive LLM visualization&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I linked to a video of Linus Lee's GPT visualization tool &lt;a href="https://simonwillison.net/2023/Sep/5/a-token-wise-likelihood-visualizer-for-gpt-2/"&gt;the other day&lt;/a&gt;. Today he's released a new version of it that people can actually play with: it runs entirely in a browser, powered by a 120MB version of the GPT-2 ONNX model loaded using the brilliant Transformers.js JavaScript library.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/thesephist/status/1699190649096933474"&gt;@thesephist&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="llms"/><category term="transformers-js"/><category term="onnx"/></entry><entry><title>Wikipedia search-by-vibes through millions of pages offline</title><link href="https://simonwillison.net/2023/Sep/4/wikipedia-search-by-vibes-through-millions-of-pages-offline/#atom-tag" rel="alternate"/><published>2023-09-04T21:13:50+00:00</published><updated>2023-09-04T21:13:50+00:00</updated><id>https://simonwillison.net/2023/Sep/4/wikipedia-search-by-vibes-through-millions-of-pages-offline/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.leebutterman.com/2023/06/01/offline-realtime-embedding-search.html"&gt;Wikipedia search-by-vibes through millions of pages offline&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really cool demo by Lee Butterman, who built embeddings of 2 million Wikipedia pages and figured out how to serve them directly to the browser, where they are used to implement “vibes based” similarity search returning results in 250ms. Lots of interesting details about how he pulled this off, using Arrow as the file format and ONNX to run the model in the browser.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/leebutterman/status/1697645296963006698"&gt;@leebutterman&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/embedding"&gt;embedding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/wikipedia"&gt;wikipedia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/onnx"&gt;onnx&lt;/a&gt;&lt;/p&gt;



</summary><category term="embedding"/><category term="search"/><category term="wikipedia"/><category term="webassembly"/><category term="onnx"/></entry></feed>