Simon Willison’s Weblog

Subscribe

September 2023

101 posts: 5 entries, 35 links, 5 quotes, 56 beats

Sept. 15, 2023

TIL Running tests against multiple versions of a Python dependency in GitHub Actions — My [datasette-export-notebook](https://github.com/simonw/datasette-export-notebook) plugin worked fine in the stable release of Datasette, currently version [0.64.3](https://docs.datasette.io/en/stable/changelog.html#v0-64-3), but failed in the Datasette 1.0 alphas. Here's the [issue describing the problem](https://github.com/simonw/datasette-export-notebook/issues/17).

Sept. 16, 2023

How CPython Implements and Uses Bloom Filters for String Processing. Fascinating dive into Python string internals by Abhinav Upadhyay. It turns out CPython uses very simple bloom filters in several parts of the core string methods, to solve problems like splitting on newlines where there are actually eight codepoints that could represent a newline, and a tiny bloom filter can help filter a character in a single operation before performing all eight comparisons only if that first check failed.

# 10:32 pm / bloom-filters, performance, python

Notes on using a single-person Mastodon server. Julia Evans experiences running a single-person Mastodon server (on masto.host—the same host I use for my own) pretty much exactly match what I’ve learned so far as well. The biggest disadvantage is the missing replies issue, where your server only shows replies to posts that come from people who you follow—so it’s easy to reply to something in a way that duplicates other replies that are invisible to you.

# 10:35 pm / julia-evans, mastodon

Sept. 17, 2023

TIL Limited JSON API for Google searches using Programmable Search Engine — I figured out how to use a JSON API to run a very limited Google search today in a legit, non-screen-scraper way.

Weeknotes: Embeddings, more embeddings and Datasette Cloud

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

[... 2,427 words]

Sept. 18, 2023

Note that there have been no breaking changes since the [SQLite] file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the "reserved for future extensions" bits with descriptions of those extensions as they occurred.

D. Richard Hipp

# 6:02 pm / sqlite, d-richard-hipp

Sept. 19, 2023

Release llm 0.11 — Access large language models from the command-line

LLM 0.11. I released LLM 0.11 with support for the new gpt-3.5-turbo-instruct completion model from OpenAI.

The most interesting feature of completion models is the option to request “log probabilities” from them, where each token returned is accompanied by up to 5 alternatives that were considered, along with their scores.

# 3:28 pm / projects, ai, openai, generative-ai, llms, llm

The WebAssembly Go Playground (via) Jeff Lindsay has a full Go 1.21.1 compiler running entirely in the browser.

# 7:53 pm / go, jeff-lindsay, webassembly

Sept. 20, 2023

Release datasette-mask-columns 0.2.2 — Datasette plugin that masks specified database columns
Release datasette-sqlite-debug-authorizer 0.1 — Debug SQLite authorizer calls
Release datasette-upload-dbs 0.3.1 — Upload SQLite database files to Datasette

Sept. 21, 2023

Release datasette 0.64.4 — An open source multi-tool for exploring and publishing data
Release datasette 1.0a7 — An open source multi-tool for exploring and publishing data

Sept. 22, 2023

Release llm-llama-cpp 0.2b0 — LLM plugin for running models using llama.cpp

Sept. 23, 2023

TG: Polygon indexing (via) TG is a brand new geospatial library by Josh Baker, author of the Tile38 in-memory spatial server (kind of a geospatial Redis). TG is written in pure C and delivered as a single C file, reminiscent of the SQLite amalgamation.

TG looks really interesting. It implements almost the exact subset of geospatial functionality that I find most useful: point-in-polygon, intersect, WKT, WKB, and GeoJSON—all with no additional dependencies.

The most interesting thing about it is the way it handles indexing. In this documentation Josh describes two approaches he uses to speeding up point-in-polygon and intersection using a novel approach that goes beyond the usual RTree implementation.

I think this could make the basis of a really useful SQLite extension—a lighter-weight alternative to SpatiaLite.

# 4:32 am / c, geospatial, gis, spatialite, sqlite, geojson, tg

TIL Trying out the facebook/musicgen-small sound generation model — Facebook's [musicgen](https://huggingface.co/facebook/musicgen-small) is a model that generates snippets of audio from a text description - it's effectively a Stable Diffusion for music.

Sept. 24, 2023

Should you give candidates feedback on their interview performance? Jacob provides a characteristically nuanced answer to the question of whether you should provide feedback to candidates you have interviewed. He suggests offering the candidate the option to email asking for feedback early in the interview process to avoid feeling pushy later on, and proposes the phrase “you failed to demonstrate...” as a useful framing device.

# 10:25 pm / jacob-kaplan-moss, management

Sept. 25, 2023

A Hackers’ Guide to Language Models. Jeremy Howard’s new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you’re an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset.

# 12:24 am / python, ai, openai, generative-ai, llama, llms, jeremy-howard, fine-tuning, nvidia

We already know one major effect of AI on the skills distribution: AI acts as a skills leveler for a huge range of professional work. If you were in the bottom half of the skill distribution for writing, idea generation, analyses, or any of a number of other professional tasks, you will likely find that, with the help of AI, you have become quite good.

Ethan Mollick

# 4:37 pm / ai, generative-ai, llms, ethan-mollick

TIL Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg — [TG](https://github.com/tidwall/tg) is an exciting new project in the world of open source geospatial libraries. It's a single C file (an amalgamation, similar to that provided by SQLite) which implements the subset of geospatial operations that I most frequently find myself needing:

Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. Alex Garcia built sqlite-tg—a SQLite extension that uses the brand new TG geospatial library to provide a whole suite of custom SQL functions for working with geospatial data.

Here are my notes on trying out his initial alpha releases. The extension already provides tools for converting between GeoJSON, WKT and WKB, plus the all important tg_intersects() function for testing if a polygon or point overlap each other.

It’s pretty useful already. Without any geospatial indexing at all I was still able to get 700ms replies to a brute-force point-in-polygon query against 150MB of GeoJSON timezone boundaries stored as JSON text in a table.

# 7:45 pm / geospatial, gis, sqlite, geojson, datasette, alex-garcia, tg

Upsert in SQL (via) Anton Zhiyanov is currently on a one-man quest to write detailed documentation for all of the fundamental SQL operations, comparing and contrasting how they work across multiple engines, generally with interactive examples.

Useful tips in here on why “insert... on conflict” is usually a better option than “insert or replace into” because the latter can perform a delete and then an insert, firing triggers that you may not have wanted to be fired.

# 8:34 pm / databases, postgresql, sql, sqlite

Sept. 26, 2023

Release datasette-auth-tokens 0.4a4 — Datasette plugin for authenticating access using API tokens

Batch size one billion: SQLite insert speedups, from the useful to the absurd (via) Useful, detailed review of ways to maximize the performance of inserting a billion integers into a SQLite database table.

# 5:31 pm / performance, sqlite

TIL Snapshot testing with Syrupy — I'm a big fan of snapshot testing - writing tests where you compare the output of some function to a previously saved version, and can re-generate that version from scratch any time something changes.

Rethinking the Luddites in the Age of A.I. I’ve been staying way clear of comparisons to Luddites in conversations about the potential harmful impacts of modern AI tools, because it seemed to me like an offensive, unproductive cheap shot.

This article has shown me that the comparison is actually a lot more relevant—and sympathetic—than I had realized.

In a time before labor unions, the Luddites represented an early example of a worker movement that tried to stand up for their rights in the face of transformational, negative change to their specific way of life.

“Knitting machines known as lace frames allowed one employee to do the work of many without the skill set usually required” is a really striking parallel to what’s starting to happen with a surprising array of modern professions already.

# 11:45 pm / ethics, ai, generative-ai, llms, ai-ethics

Sept. 27, 2023

The profusion of dubious A.I.-generated content resembles the badly made stockings of the nineteenth century. At the time of the Luddites, many hoped the subpar products would prove unacceptable to consumers or to the government. Instead, social norms adjusted.

Kyle Chayka

# 12:26 am / ethics, ai, generative-ai, llms, ai-ethics

Optimizing for Taste. David Cramer’s detailed explanation as to why his company Sentry mostly avoids A/B testing. David wrote this as an internal blog post originally, but is now sharing it with the world. I found myself nodding along vigorously as I read this—lots of astute observations here.

I particularly appreciated his closing note: “The strength of making a decision is making it. You can always make a new one later. Choose the obvious path forward, and if you don’t see one, find someone who does.”

# 4:34 am / ab-testing, david-cramer, sentry

Finding Bathroom Faucets with Embeddings. Absolutely the coolest thing I’ve seen someone build on top of my LLM tool so far: Drew Breunig is renovating a bathroom and needed a way to filter through literally thousands of options for facet taps. He scraped 20,000 images of fixtures from a plumbing supply site and used LLM to embed every one of them via CLIP... and now he can ask for “faucets that look like this one”, or even run searches for faucets that match “Gawdy” or “Bond Villain” or “Nintendo 64”. Live demo included!

# 6:18 pm / ai, generative-ai, embeddings, llm, drew-breunig, clip