Archive for December 2025

December 2025

153 posts: 12 entries, 43 links, 19 quotes, 6 notes, 73 beats

Dec. 19, 2025

Agent Skills. Anthropic have turned their skills mechanism into an "open standard", which I guess means it lives in an independent agentskills/agentskills GitHub repository now? I wouldn't be surprised to see this end up in the AAIF, recently the new home of the MCP specification.

The specification itself lives at agentskills.io/specification, published from docs/specification.mdx in the repo.

It is a deliciously tiny specification - you can read the entire thing in just a few minutes. It's also quite heavily under-specified - for example, there's a metadata field described like this:

Clients can use this to store additional properties not defined by the Agent Skills spec

We recommend making your key names reasonably unique to avoid accidental conflicts

And an allowed-skills field:

Experimental. Support for this field may vary between agent implementations

Example:
allowed-tools: Bash(git:*) Bash(jq:*) Read

The Agent Skills homepage promotes adoption by OpenCode, Cursor,Amp, Letta, goose, GitHub, and VS Code. Notably absent is OpenAI, who are quietly tinkering with skills but don't appear to have formally announced their support just yet.

Update 20th December 2025: OpenAI have added Skills to the Codex documentation and the Codex logo is now featured on the Agent Skills homepage (as of this commit.)

# 1:09 am / ai, generative-ai, llms, anthropic, ai-agents, coding-agents, skills

Introducing GPT-5.2-Codex. The latest in OpenAI's Codex family of models (not the same thing as their Codex CLI or Codex Cloud coding agent tools).

GPT‑5.2-Codex is a version of GPT‑5.2⁠ further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.

As with some previous Codex models this one is available via their Codex coding agents now and will be coming to the API "in the coming weeks". Unlike previous models there's a new invite-only preview process for vetted cybersecurity professionals for "more permissive models".

I've been very impressed recently with GPT 5.2's ability to tackle multi-hour agentic coding challenges. 5.2 Codex scores 64% on the Terminal-Bench 2.0 benchmark that GPT-5.2 scored 62.2% on. I'm not sure how concrete that 1.8% improvement will be!

I didn't hack API access together this time (see previous attempts), instead opting to just ask Codex CLI to "Generate an SVG of a pelican riding a bicycle" while running the new model (effort medium). Here's the transcript in my new Codex CLI timeline viewer, and here's the pelican it drew:

Alt text by GPT-5.2-Codex: A minimalist illustration of a white pelican with a large orange beak riding a teal bicycle across a sandy strip of ground. The pelican leans forward as if pedaling, its wings tucked back and legs reaching toward the pedals. Simple gray motion lines trail behind it, and a pale yellow sun sits in the top‑right against a warm beige sky.

# 5:21 am / ai, openai, generative-ai, llms, pelican-riding-a-bicycle, llm-release, codex-cli, gpt-codex

Tool JSON Diff Tool — Compare JSON documents side-by-side to identify additions, removals, and modifications between two versions. The tool displays differences with color-coded highlighting and provides character-level detail for string changes, making it easy to spot exactly what has changed. Preloaded examples are available to explore common use cases like configuration updates, user profile changes, and API response comparisons.

19th Dec 2025, 4:14 pm

Sam Rose explains how LLMs work with a visual essay. Sam Rose is one of my favorite authors of explorable interactive explanations - here's his previous collection.

Sam joined ngrok in September as a developer educator. Here's his first big visual explainer for them, ostensibly about how prompt caching works but it quickly expands to cover tokenization, embeddings, and the basics of the transformer architecture.

The result is one of the clearest and most accessible introductions to LLM internals I've seen anywhere.

Animation. Starts in tokens mode with an array of 75, 305, 24, 887 - clicking embeddings animates those into a 2D array showing each one to be composed of three floating point numbers.

# 6:33 pm / ai, explorables, generative-ai, llms, sam-rose, tokenization

In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples).

— Andrej Karpathy, 2025 LLM Year in Review

# 11:07 pm / andrej-karpathy, generative-ai, llm-reasoning, definitions, ai, llms, deepseek

Dec. 20, 2025

Tool CSS Grid Lanes Polyfill Demo — # CSS Grid Lanes Polyfill Demo

20th Dec 2025, 6:55 am

Dec. 21, 2025

Release preview-server 0.1a0 — Proxy server for previewing web app changes via uv

21st Dec 2025, 4:46 am

Every time you are inclined to use the word “teach”, replace it with “learn”. That is, instead of saying, “I teach”, say “They learn”. It’s very easy to determine what you teach; you can just fill slides with text and claim to have taught. Shift your focus to determining how you know whether they learned what you claim to have taught (or indeed anything at all!). That is much harder, but that is also the real objective of any educator.

— Shriram Krishnamurthi, Pedagogy Recommendations

# 5:26 am / teaching

Release preview-server 0.2a0 — Proxy server for previewing web app changes via uv

21st Dec 2025, 6:43 am

Release preview-server 0.2a1 — Proxy server for previewing web app changes via uv

21st Dec 2025, 6:54 am

Release git-history 0.7 — Tools for analyzing Git history using SQLite

21st Dec 2025, 8:22 pm

Dec. 22, 2025

Research environment-report — Running Claude Code on the web offers developers a versatile coding sandbox on Ubuntu 24.04, leveraging a broad toolkit that includes Python 3.11, Node.js 22, Go, Rust, and more, alongside developer utilities (Git, Make) and database clients (SQLite, PostgreSQL).

22nd Dec 2025, 12 am

Tool Photo print layout — Create custom photo print layouts for A4 pages with adjustable grid configurations, image fitting options, and portrait or landscape orientations. Users can add photos by uploading files or dragging and dropping them directly onto the page, then remove individual photos or clear the entire layout before printing. The preview displays actual print dimensions, allowing precise control over how photos will appear on the final printed output.

22nd Dec 2025, 3:22 am

Release preview-server 0.2a2 — Proxy server for previewing web app changes via uv

22nd Dec 2025, 6:19 am

Release preview-server 0.2a3 — Proxy server for previewing web app changes via uv

22nd Dec 2025, 3:19 pm

I just had my first success using a browser agent - in this case the Claude in Chrome extension - to solve an actual problem.

A while ago I set things up so anything served from the https://static.simonwillison.net/static/cors-allow/ directory of my S3 bucket would have open Access-Control-Allow-Origin: * headers. This is useful for hosting files online that can be loaded into web applications hosted on other domains.

Problem is I couldn't remember how I did it! I initially thought it was an S3 setting, but it turns out S3 lets you set CORS at the bucket-level but not for individual prefixes.

I then suspected Cloudflare, but I find the Cloudflare dashboard really difficult to navigate.

So I decided to give Claude in Chrome a go. I installed and enabled the extension (you then have to click the little puzzle icon and click "pin" next to Claude for the icon to appear, I had to ask Claude itself for help figuring that out), signed into Cloudflare, opened the Claude panel and prompted:

I'm trying to figure out how come all pages under http://static.simonwillison.net/static/cors/ have an open CORS policy, I think I set that up through Cloudflare but I can't figure out where

Off it went. It took 1m45s to find exactly what I needed.

Claude's conclusion:

Found it! Your open CORS policy for the /static/cors/ directory on static.simonwillison.net is configured through Cloudflare Transform Rules, specifically a Response Header Transform Rule named static.simonwillis.net/static/cors-allow/*

There's no "share transcript" option but I used copy and paste and two gnarly Claude Code sessions (one, two) to turn it into an HTML transcript which you can take a look at here.

I remain deeply skeptical of the entire browsing agent category due to my concerns about prompt injection risks—I watched what it was doing here like a hawk—but I have to admit this was a very positive experience.

# 4:10 pm / anthropic, claude, browser-agents, cors, ai, llms, generative-ai, chrome, cloudflare, prompt-injection, ai-agents

Dec. 23, 2025

Research mquickjs-sandbox — Exploring mquickjs, a highly minimal JavaScript engine, this project rigorously evaluates its suitability as a safe sandbox for running untrusted code. Various integration approaches are implemented, including Python FFI, C extensions, subprocess invocation, and WebAssembly runtimes—each tested for startup and execution performance, security isolation, and feature compatibility.

23rd Dec 2025, 12 am

Research url-limits-investigation — Major browser engines demonstrate significant differences in how they enforce URL length limits. Chromium sets a 2 MB cap at its inter-process communication boundary, rejecting longer URLs when crossing processes. Firefox relies on user-configurable preferences, employing a 1 MB "standard" limit but permitting up to 512 MB in absolute terms, with stricter limits (2,000 characters) for history and bookmarks.

23rd Dec 2025, 12 am

Tool Dual Recipe Cooking Timer — # Documentation

23rd Dec 2025, 2:47 am

Cooking with Claude

I’ve been having an absurd amount of fun recently using LLMs for cooking. I started out using them for basic recipes, but as I’ve grown more confident in their culinary abilities I’ve leaned into them for more advanced tasks. Today I tried something new: having Claude vibe-code up a custom application to help with the timing for a complicated meal preparation. It worked really well!

[... 1,313 words]

5:01 am / cooking, devfort, localstorage, tools, ai, generative-ai, llms, anthropic, claude, vision-llms, vibe-coding

Tool llm-lib demo — # Documentation

23rd Dec 2025, 6:18 am

Release llm-gemini 0.28.2 — LLM plugin to access Google's Gemini family of models

23rd Dec 2025, 4:20 pm · llm

MicroQuickJS. New project from programming legend Fabrice Bellard, of ffmpeg and QEMU and QuickJS and so much more fame:

MicroQuickJS (aka. MQuickJS) is a Javascript engine targetted at embedded systems. It compiles and runs Javascript programs with as low as 10 kB of RAM. The whole engine requires about 100 kB of ROM (ARM Thumb-2 code) including the C library. The speed is comparable to QuickJS.

It supports a subset of full JavaScript, though it looks like a rich and full-featured subset to me.

One of my ongoing interests is sandboxing: mechanisms for executing untrusted code - from end users or generated by LLMs - in an environment that restricts memory usage and applies a strict time limit and restricts file or network access. Could MicroQuickJS be useful in that context?

I fired up Claude Code for web (on my iPhone) and kicked off an asynchronous research project to see explore that question:

My full prompt is here. It started like this:

Clone https://github.com/bellard/mquickjs to /tmp

Investigate this code as the basis for a safe sandboxing environment for running untrusted code such that it cannot exhaust memory or CPU or access files or the network

First try building python bindings for this using FFI - write a script that builds these by checking out the code to /tmp and building against that, to avoid copying the C code in this repo permanently. Write and execute tests with pytest to exercise it as a sandbox

Then build a "real" Python extension not using FFI and experiment with that

Then try compiling the C to WebAssembly and exercising it via both node.js and Deno, with a similar suite of tests [...]

I later added to the interactive session:

Does it have a regex engine that might allow a resource exhaustion attack from an expensive regex?

(The answer was no - the regex engine calls the interrupt handler even during pathological expression backtracking, meaning that any configured time limit should still hold.)

Here's the full transcript and the final report.

Some key observations:

MicroQuickJS is very well suited to the sandbox problem. It has robust near and time limits baked in, it doesn't expose any dangerous primitive like filesystem of network access and even has a regular expression engine that protects against exhaustion attacks (provided you configure a time limit).
Claude span up and tested a Python library that calls a MicroQuickJS shared library (involving a little bit of extra C), a compiled a Python binding and a library that uses the original MicroQuickJS CLI tool. All of those approaches work well.
Compiling to WebAssembly was a little harder. It got a version working in Node.js and Deno and Pyodide, but the Python libraries wasmer and wasmtime proved harder, apparently because "mquickjs uses setjmp/longjmp for error handling". It managed to get to a working wasmtime version with a gross hack.

I'm really excited about this. MicroQuickJS is tiny, full featured, looks robust and comes from excellent pedigree. I think this makes for a very solid new entrant in the quest for a robust sandbox.

Update: I had Claude Code build tools.simonwillison.net/microquickjs, an interactive web playground for trying out the WebAssembly build of MicroQuickJS, adapted from my previous QuickJS plaground. My QuickJS page loads 2.28 MB (675 KB transferred). The MicroQuickJS one loads 303 KB (120 KB transferred).

Here are the prompts I used for that.

# 8:53 pm / c, javascript, nodejs, python, sandboxing, ai, webassembly, deno, pyodide, generative-ai, llms, claude-code, fabrice-bellard

Tool MicroQuickJS Code Executor — Execute JavaScript code in a lightweight MicroQuickJS sandbox environment running via WebAssembly, with results displayed directly on the page. The sandbox supports ES5-like JavaScript features and automatically saves your code in the URL for easy sharing and recovery. Choose between optimized and original WebAssembly versions, try built-in examples, and use Ctrl+Enter to quickly run your code.

23rd Dec 2025, 9:26 pm

If this [MicroQuickJS] had been available in 2010, Redis scripting would have been JavaScript and not Lua. Lua was chosen based on the implementation requirements, not on the language ones... (small, fast, ANSI-C). I appreciate certain ideas in Lua, and people love it, but I was never able to like Lua, because it departs from a more Algol-like syntax and semantics without good reasons, for my taste. This creates friction for newcomers. I love friction when it opens new useful ideas and abstractions that are worth it, if you learn SmallTalk or FORTH and for some time you are lost, it's part of how the languages are different. But I think for Lua this is not true enough: it feels like it departs from what people know without good reasons.

— Salvatore Sanfilippo, Hacker News comment on MicroQuickJS

# 11:03 pm / salvatore-sanfilippo, lua, redis, javascript

Dec. 24, 2025

Research microquickjs-in-redis — Expanding Redis’s scripting capabilities, the Redis JavaScript Module enables users to execute JavaScript scripts in Redis through the fast, embedded mquickjs engine, paralleling the Lua scripting features but with a JavaScript syntax. This module introduces commands like `JS.EVAL`, `JS.LOAD`, and `JS.CALL`, supporting script execution, caching, and invocation by SHA1 hash, along with native integrations for running Redis commands, logging, and error handling within scripts.

24th Dec 2025, 12 am

Research debug-failed-fix — Debugging investigation into why commit 0dcfad4's fix for cog code rendering didn't work. The fix correctly used string concatenation to avoid `-->` in Python strings, but the explanatory comment itself contained the literal `-->` sequence, which closed the HTML comment early. Solution: rewrote the comment to avoid the problematic character sequence.

24th Dec 2025, 12 am

uv-init-demos. uv has a useful uv init command for setting up new Python projects, but it comes with a bunch of different options like --app and --package and --lib and I wasn't sure how they differed.

So I created this GitHub repository which demonstrates all of those options, generated using this update-projects.sh script (thanks, Claude) which will run on a schedule via GitHub Actions to capture any changes made by future releases of uv.

# 10:05 pm / projects, python, github-actions, git-scraping, uv

Release claude-code-transcripts 0.1 — Tools for publishing transcripts for Claude Code sessions

24th Dec 2025, 11:45 pm

Dec. 25, 2025

Research vibium-python-client — Examining the Vibium browser automation project, this investigation developed a Python client library that interoperates with Vibium’s Go-powered "clicker" binary and existing Node.js tools. The Python client exposes both synchronous and asynchronous APIs, replicating advanced browser automation features such as auto-waiting, visibility checks, and custom commands (e.g., `vibium:find`, `vibium:click`) via WebDriver BiDi over WebSocket.

25th Dec 2025, 12 am

«« first « previous page 4 / 6 next » last »»