Simon Willison’s Weblog

Subscribe

Monday, 9th February 2026

Release datasette-sqlite-history-json 0.1a0 — Track changes to data using triggers and hidden JSON tables
Release showboat v0.3.0 — Create executable documents that demonstrate an agent's work
Release showboat v0.4.0 — Create executable documents that demonstrate an agent's work

AI Doesn’t Reduce Work—It Intensifies It (via) Aruna Ranganathan and Xingqi Maggie Ye from Berkeley Haas School of Business report initial findings in the HBR from their April to December 2025 study of 200 employees at a "U.S.-based technology company".

This captures an effect I've been observing in my own work with LLMs: the productivity boost these things can provide is exhausting.

AI introduced a new rhythm in which workers managed several active threads at once: manually writing code while AI generated an alternative version, running multiple agents in parallel, or reviving long-deferred tasks because AI could “handle them” in the background. They did this, in part, because they felt they had a “partner” that could help them move through their workload.

While this sense of having a “partner” enabled a feeling of momentum, the reality was a continual switching of attention, frequent checking of AI outputs, and a growing number of open tasks. This created cognitive load and a sense of always juggling, even as the work felt productive.

I'm frequently finding myself with work on two or three projects running parallel. I can get so much done, but after just an hour or two my mental energy for the day feels almost entirely depleted.

I've had conversations with people recently who are losing sleep because they're finding building yet another feature with "just one more prompt" irresistible.

The HBR piece calls for organizations to build an "AI practice" that structures how AI is used to help avoid burnout and counter effects that "make it harder for organizations to distinguish genuine productivity gains from unsustainable intensity".

I think we've just disrupted decades of existing intuition about sustainable working practices. It's going to take a while and some discipline to find a good new balance.

# 4:43 pm / careers, ai, generative-ai, llms, ai-assisted-programming, ai-ethics, cognitive-debt

Release sqlite-history-json 0.4 — SQLite table history tracking using a JSON audit log
Research Rod: Go Library for Chrome Automation -- Comprehensive API Reference — Rod is an advanced Go library designed to automate Chrome browsers using the Chrome DevTools Protocol, providing a comprehensive API for web scraping, browser control, element interaction, and robust waiting strategies. With high-level convenience methods (such as Must-prefixed methods for fast scripting) and direct protocol access, Rod enables streamlined workflows from simple scraping to complex automation scenarios, all without third-party drivers.
Research rod-cli: Chrome Automation from the Command Line — Leveraging the rod browser automation library, rod-cli provides a lightweight Go-based command-line tool for scripting persistent headless Chrome sessions. Each CLI command connects to and manipulates the same long-running Chrome instance via DevTools Protocol, enabling seamless multi-step browser automation in shell scripts or interactive use.
Release rodney v0.1.0 — CLI tool for interacting with the web

Structured Context Engineering for File-Native Agentic Systems (via) New paper by Damon McMillan exploring challenging LLM context tasks involving large SQL schemas (up to 10,000 tables) across different models and file formats:

Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables.

Unsurprisingly, the biggest impact was the models themselves - with frontier models (Opus 4.5, GPT-5.2, Gemini 2.5 Pro) beating the leading open source models (DeepSeek V3.2, Kimi K2, Llama 4).

Those frontier models benefited from filesystem based context retrieval, but the open source models had much less convincing results with those, which reinforces my feeling that the filesystem coding agent loops aren't handled as well by open weight models just yet. The Terminal Bench 2.0 leaderboard is still dominated by Anthropic, OpenAI and Gemini.

The "grep tax" result against TOON was an interesting detail. TOON is meant to represent structured data in as few tokens as possible, but it turns out the model's unfamiliarity with that format led to them spending significantly more tokens over multiple iterations trying to figure it out:

Screenshot of a figure from a research paper. Introductory text reads: "As schema size increased, TOON showed dramatically increased token consumption for Claude models despite being ~25% smaller in file size. Scale experiments used Claude models only." Below is "Figure 7: The 'Grep Tax' - TOON Token Overhead at Scale", a bar chart with a logarithmic y-axis labeled "Tokens" comparing YAML (teal) and TOON (purple) at two schema sizes: S5 (500 tables) and S9 (10,000 tables). At S5, TOON is +138% more tokens than YAML (~1,100 vs ~450). At S9, TOON is +740% more tokens (~50,000 vs ~7,000). Below the chart, explanatory text reads: "The 'grep tax' emerged as schema size scaled. At S5 (500 tables), TOON consumed 138% more tokens than YAML; at S9 (10,000 tables), this grew to 740%. Root cause: models lacked familiarity with TOON's syntax and could not construct effective refinement patterns."

# 11:56 pm / ai, prompt-engineering, generative-ai, llms, paper-review, context-engineering

Sunday, 8th February 2026
Tuesday, 10th February 2026