Simon Willison’s Weblog

Subscribe

March 2024

149 posts: 8 entries, 74 links, 12 quotes, 55 beats

March 19, 2024

Release datasette-enrichments 0.3.1 — Tools for running enrichments against data stored in Datasette

DiskCache (via) Grant Jenks built DiskCache as an alternative caching backend for Django (also usable without Django), using a SQLite database on disk. The performance numbers are impressive—it even beats memcached in microbenchmarks, due to avoiding the need to access the network.

The source code (particularly in core.py) is a great case-study in SQLite performance optimization, after five years of iteration on making it all run as fast as possible.

# 3:43 pm / django, performance, python, sqlite

People share a lot of sensitive material on Quora - controversial political views, workplace gossip and compensation, and negative opinions held of companies. Over many years, as they change jobs or change their views, it is important that they can delete or anonymize their previously-written answers.

We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it.

quora.com/robots.txt

# 11:09 pm / internet-archive, robots-txt, quora

March 20, 2024

Release datasette-paste 0.1a1 — Paste data to create tables in Datasette

Papa Parse (via) I’ve been trying out this JavaScript library for parsing CSV and TSV data today and I’m very impressed. It’s extremely fast, has all of the advanced features I want (streaming support, optional web workers, automatically detecting delimiters and column types), has zero dependencies and weighs just 19KB minified—6.8KB gzipped.

The project is 11 years old now. It was created by Matt Holt, who later went on to create the Caddy web server. Today it’s maintained by Sergi Almacellas Abellana.

# 12:53 am / csv, javascript, matt-holt

Release datasette-paste 0.1a2 — Paste data to create tables in Datasette

AI Prompt Engineering Is Dead. Long live AI prompt engineering. Ignoring the clickbait in the title, this article summarizes research around the idea of using machine learning models to optimize prompts—as seen in tools such as Stanford’s DSPy and Google’s OPRO.

The article includes possibly the biggest abuse of the term “just” I have ever seen:

“But that’s where hopefully this research will come in and say ‘don’t bother.’ Just develop a scoring metric so that the system itself can tell whether one prompt is better than another, and then just let the model optimize itself.”

Developing a scoring metric to determine which prompt works better remains one of the hardest challenges in generative AI!

Imagine if we had a discipline of engineers who could reliably solve that problem—who spent their time developing such metrics and then using them to optimize their prompts. If the term “prompt engineer” hadn’t already been reduced to basically meaning “someone who types out prompts” it would be a pretty fitting term for such experts.

# 3:22 am / ai, prompt-engineering, generative-ai, llms, dspy

Every dunder method in Python. Trey Hunner: "Python includes 103 'normal' dunder methods, 12 library-specific dunder methods, and at least 52 other dunder attributes of various types."

This cheat sheet doubles as a tour of many of the more obscure corners of the Python language and standard library.

I did not know that Python has over 100 dunder methods now! Quite a few of these were new to me, like __class_getitem__ which can be used to implement type annotations such as list[int].

# 3:45 am / python, trey-hunner

Skew protection in Vercel (via) Version skew is a name for the bug that occurs when your user loads a web application and then unintentionally keeps that browser tab open across a deployment of a new version of the app. If you’re unlucky this can lead to broken behaviour, where a client makes a call to a backend endpoint that has changed in an incompatible way.

Vercel have an ingenious solution to this problem. Their platform already makes it easy to deploy many different instances of an application. You can now turn on “skew protection” for a number of hours which will keep older versions of your backend deployed.

The application itself can then include its desired deployment ID in a x-deployment-id header, a __vdpl cookie or a ?dpl= query string parameter.

# 2:06 pm / frontend, zero-downtime, vercel

TIL Running self-hosted QuickJS in a browser — I want to try using [QuickJS](https://bellard.org/quickjs/) compiled to WebAssembly in a browser as a way of executing untrusted user-provided JavaScript in a sandbox.

Releasing Common Corpus: the largest public domain dataset for training LLMs (via) Released today. 500 billion words from "a wide diversity of cultural heritage initiatives". 180 billion words of English, 110 billion of French, 30 billion of German, then Dutch, Spanish and Italian.

Includes quite a lot of US public domain data - 21 million digitized out-of-copyright newspapers (or do they mean newspaper articles?)

This is only an initial part of what we have collected so far, in part due to the lengthy process of copyright duration verification. In the following weeks and months, we’ll continue to publish many additional datasets also coming from other open sources, such as open data or open science.

Coordinated by French AI startup Pleias and supported by the French Ministry of Culture, among others.

I can't wait to try a model that's been trained on this.

# 7:34 pm / copyright, ethics, ai, generative-ai, llms, training-data, pleias, ai-ethics

TIL Reviewing your history of public GitHub repositories using ClickHouse — There's a story going around at the moment that people have found code from their private GitHub repositories in the AI training data known as The Stack, using this search tool: https://huggingface.co/spaces/bigcode/in-the-stack

GitHub Public repo history tool (via) I built this Observable Notebook to run queries against the GH Archive (via ClickHouse) to try to answer questions about repository history—in particular, were they ever made public as opposed to private in the past.

It works by combining together PublicEvent event (moments when a private repo was made public) with the most recent PushEvent event for each of a user’s repositories.

# 9:56 pm / github, projects, observable, clickhouse

Release datasette-paste 0.1a3 — Paste data to create tables in Datasette

March 21, 2024

Talking about Django’s history and future on Django Chat (via) Django co-creator Jacob Kaplan-Moss sat down with the Django Chat podcast team to talk about Django’s history, his recent return to the Django Software Foundation board and what he hopes to achieve there.

Here’s his post about it, where he used Whisper and Claude to extract some of his own highlights from the conversation.

# 12:42 am / django, jacob-kaplan-moss, podcasts, python, dsf

I think most people have this naive idea of consensus meaning “everyone agrees”. That’s not what consensus means, as practiced by organizations that truly have a mature and well developed consensus driven process.

Consensus is not “everyone agrees”, but [a model where] people are more aligned with the process than they are with any particular outcome, and they’ve all agreed on how decisions will be made.

Jacob Kaplan-Moss

# 12:45 am / jacob-kaplan-moss, management

Redis Adopts Dual Source-Available Licensing (via) Well this sucks: after fifteen years (and contributions from more than 700 people), Redis is dropping the 3-clause BSD license going forward, instead being “dual-licensed under the Redis Source Available License (RSALv2) and Server Side Public License (SSPLv1)” from Redis 7.4 onwards.

# 2:24 am / open-source, redis

DuckDB as the New jq (via) The DuckDB CLI tool can query JSON files directly, making it a surprisingly effective replacement for jq. Paul Gross demonstrates the following query:

select license->>'key' as license, count(*) from 'repos.json' group by 1

repos.json contains an array of {"license": {"key": "apache-2.0"}..} objects. This example query shows counts for each of those licenses.

# 8:36 pm / cli, sql, jq, duckdb

At this point, I’m confident saying that 75% of what generative-AI text and image platforms can do is useless at best and, at worst, actively harmful. Which means that if AI companies want to onboard the millions of people they need as customers to fund themselves and bring about the great AI revolution, they’ll have to perpetually outrun the millions of pathetic losers hoping to use this tech to make a quick buck. Which is something crypto has never been able to do.

In fact, we may have already reached a point where AI images have become synonymous with scams and fraud.

Ryan Broderick

# 9:49 pm / ai, generative-ai

March 22, 2024

The Dropflow Playground (via) Dropflow is a “CSS layout engine” written in TypeScript and taking advantage of the HarfBuzz text shaping engine (used by Chrome, Android, Firefox and more) compiled to WebAssembly to implement glyph layout.

This linked demo is fascinating: on the left hand side you can edit HTML with inline styles, and the right hand side then updates live to show that content rendered by Dropflow in a canvas element.

Why would you want this? It lets you generate images and PDFs with excellent performance using your existing knowledge HTML and CSS. It’s also just really cool!

# 1:33 am / css, javascript, typescript, webassembly

Release files-to-prompt 0.1 — Concatenate a directory full of files into a single prompt for use with LLMs

Claude and ChatGPT for ad-hoc sidequests

Visit Claude and ChatGPT for ad-hoc sidequests

Here is a short, illustrative example of one of the ways in which I use Claude and ChatGPT on a daily basis.

[... 1,754 words]

Threads has entered the fediverse (via) Threads users with public profiles in certain countries can now turn on a setting which makes their posts available in the fediverse—so users of ActivityPub systems such as Mastodon can follow their accounts to subscribe to their posts.

It’s only a partial integration at the moment: Threads users can’t themselves follow accounts from other providers yet, and their notifications will show them likes but not boosts or replies: “For now, people who want to see replies on their posts on other fediverse servers will have to visit those servers directly.”

Depending on how you count, Mastodon has around 9m user accounts of which 1m are active. Threads claims more than 130m active monthly users. The Threads team are developing these features cautiously which is reassuring to see—a clumsy or thoughtless integration could cause all sorts of damage just from the sheer scale of their service.

# 8:15 pm / facebook, threads, mastodon, activitypub, fediverse

March 23, 2024

mapshaper.org (via) It turns out the mapshaper CLI tool for manipulating geospatial data—including converting shapefiles to GeoJSON and back again—also has a web UI that runs the conversions entirely in your browser. If you need to convert between those (and other) formats it’s hard to imagine a more convenient option.

# 3:44 am / cli, gis, javascript, shapefiles, geojson

Building and testing C extensions for SQLite with ChatGPT Code Interpreter

Visit Building and testing C extensions for SQLite with ChatGPT Code Interpreter

I wrote yesterday about how I used Claude and ChatGPT Code Interpreter for simple ad-hoc side quests—in that case, for converting a shapefile to GeoJSON and merging it into a single polygon.

[... 4,612 words]

time-machine example test for a segfault in Python (via) Here's a really neat testing trick by Adam Johnson. Someone reported a segfault bug in his time-machine library. How you you write a unit test that exercises a segfault without crashing the entire test suite?

Adam's solution is a test that does this:

subprocess.run([sys.executable, "-c", code_that_crashes_python], check=True)

sys.executable is the path to the current Python executable - ensuring the code will run in the same virtual environment as the test suite itself. The -c option can be used to have it run a (multi-line) string of Python code, and check=True causes the subprocess.run() function to raise an error if the subprocess fails to execute cleanly and returns an error code.

I'm absolutely going to be borrowing this pattern next time I need to add tests to cover a crashing bug in one of my projects.

# 7:44 pm / python, testing, adam-johnson

Strachey love letter algorithm (via) This is a beautiful piece of computer history. In 1952, Christopher Strachey—a contemporary of Alan Turing—wrote a love letter generation program for a Manchester Mark 1 computer. It produced output like this:

"Darling Sweetheart,

You are my avid fellow feeling. My affection curiously clings to your passionate wish. My liking yearns for your heart. You are my wistful sympathy: my tender liking.

Yours beautifully

M. U. C."

The algorithm simply combined a small set of predefined sentence structures, filled in with random adjectives.

Wikipedia notes that "Strachey wrote about his interest in how “a rather simple trick” can produce an illusion that the computer is thinking, and that “these tricks can lead to quite unexpected and interesting results”.

LLMs, 1952 edition!

# 9:55 pm / computer-history, history, generative-ai, llms

March 24, 2024

shelmet (via) This looks like a pleasant ergonomic alternative to Python's subprocess module, plus a whole bunch of other useful utilities. Lets you do things like this:

sh.cmd("ps", "aux").pipe("grep", "-i", check=False).run("search term")

I like the way it uses context managers as well: with sh.environ({"KEY1": "val1"}) sets new environment variables for the duration of the block, with sh.cd("path/to/dir") temporarily changes the working directory and with sh.atomicfile("file.txt") as fp lets you write to a temporary file that will be atomically renamed when the block finishes.

# 4:37 am / python

Reviving PyMiniRacer (via) PyMiniRacer is “a V8 bridge in Python”—it’s a library that lets Python code execute JavaScript code in a V8 isolate and pass values back and forth (provided they serialize to JSON) between the two environments.

It was originally released in 2016 by Sqreen, a web app security startup startup. They were acquired by Datadog in 2021 and the project lost its corporate sponsor, but in this post Ben Creech announces that he is revitalizing the project, with the approval of the original maintainers.

I’m always interested in new options for running untrusted code in a safe sandbox. PyMiniRacer has the three features I care most about: code can’t access the filesystem or network by default, you can limit the RAM available to it and you can have it raise an error if code execution exceeds a time limit.

The documentation includes a newly written architecture overview which is well worth a read. Rather than embed V8 directly in Python the authors chose to use ctypes—they build their own V8 with a thin additional C++ layer to expose a ctypes-friendly API, then the Python library code uses ctypes to call that.

I really like this. V8 is a notoriously fast moving and complex dependency, so reducing the interface to just a thin C++ wrapper via ctypes feels very sensible to me.

This blog post is fun too: it’s a good, detailed description of the process to update something like this to use modern Python and modern CI practices. The steps taken to build V8 (6.6 GB of miscellaneous source and assets!) across multiple architectures in order to create binary wheels are particularly impressive—the Linux aarch64 build takes several days to run on GitHub Actions runners (via emulation), so they use Mozilla’s Sccache to cache compilation steps so they can retry until it finally finishes.

On macOS (Apple Silicon) installing the package with “pip install mini-racer” got me a 37MB dylib and a 17KB ctypes wrapper module.

# 5 pm / ctypes, javascript, open-source, python, v8

TIL Google Chrome --headless mode — In the README for [monolith](https://github.com/Y2Z/monolith) (a new Rust CLI tool for archiving HTML pages along with their images and assets) I spotted this tip for using Chrome in headless mode to execute JavaScript and output the resulting DOM: