Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking
4th June 2023
I’ve fallen a bit behind on my weeknotes. Here’s a catchup for the last few weeks.
Parquet in Datasette Lite
Datasette Lite is my build of Datasette (a server-side Python web application) which runs entirely in the browser using WebAssembly and Pyodide. I recently added the ability to directly load Parquet files over HTTP.
This required an upgrade to the underlying version of Pyodide, in order to use the WebAssembly compiled version of the fastparquet library. That upgrade was blocked by a AttributeError: module 'os' has no attribute 'link'
error, but Roman Yurchak showed me a workaround which unblocked me.
So now the following works:
This will work with any URL to a Parquet file that is served with open CORS headers—files on GitHub (or in a GitHub Gist) get these headers automatically.
Also new in Datasette Lite: the ?memory=1
query string option, which starts Datasette Lite without loading any default demo databases. I added this to help me construct this demo for my new datasette-sqlite-url-lite plugin:
datasette-sqlite-url-lite—mostly written by GPT-4
datasette-sqlite-url is a really neat plugin by Alex Garcia which adds custom SQL functions to SQLite that allow you to parse URLs and extract their components.
There’s just one catch: the extension itself is written in C, and there isn’t yet a version of it compiled for WebAssembly to work in Datasette Lite.
I wanted to use some of the functions in it, so I decided to see if I could get a Pure Python alternative of it working. But this was a very low stakes project, so I decided to see if I could get GPT-4 to do essentially all of the work for me.
I prompted it like this—copying and pasting the examples directly from Alex’s documentation:
Write Python code to register the following SQLite custom functions:
select url_valid('https://sqlite.org'); -- 1 select url_scheme('https://www.sqlite.org/vtab.html#usage'); -- 'https' select url_host('https://www.sqlite.org/vtab.html#usage'); -- 'www.sqlite.org' select url_path('https://www.sqlite.org/vtab.html#usage'); -- '/vtab.html' select url_fragment('https://www.sqlite.org/vtab.html#usage'); -- 'usage'
The code it produced was almost exactly what I needed.
I wanted some tests too, so I prompted:
Write a suite of pytest tests for this
This gave me the tests I needed—with one error in the way they called SQLite, but still doing 90% of the work for me.
Here’s the full ChatGPT conversation and the resulting code I checked into the repo.
Various talks
Videos for three of my recent talks are now available on YouTube:
- Big Opportunities in Small Data is the keynote I gave at Citus Con: An Event for Postgres 2023—talking about Datasette, SQLite and some tricks I would love to see the PostgreSQL community adopt from the explorations I’ve been doing around small data.
- The Data Enthusiast’s Toolkit is an hour long interview with Rizel Scarlett about both Datasette and my career to date. Frustratingly I had about 10 minutes of terrible microphone audio in the middle, but the conversation itself was really great.
- Data analysis with SQLite and Python is a video from PyCon of the full 2hr45m tutorial I gave there last month. The handout notes for that are available online too.
I also spotted that the Changelog put up a video Just getting in to AI for development? Start here with an extract from our podcast episode LLMs break the internet.
Entries this week
- It’s infuriatingly hard to understand how closed models train on their input
- ChatGPT should include inline tips
- Lawyer cites fake cases invented by ChatGPT, judge is not amused
- llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs
- Delimiters won’t save you from prompt injection
Releases this week
-
datasette-sqlite-url-lite 0.1—2023-05-26
A pure Python alternative to sqlite-url ready to be used in Datasette Lite -
sqlite-utils 3.32.1—2023-05-21
Python CLI utility and library for manipulating SQLite databases -
strip-tags 0.3—2023-05-19
CLI tool for stripping tags from HTML -
ttok 0.1—2023-05-18
Count and truncate text based on tokens -
llm 0.3—2023-05-17
Access large language models from the command-line
TIL this week
- Testing the Access-Control-Max-Age CORS header—2023-05-25
- Comparing two training datasets using sqlite-utils—2023-05-23
- mlc-chat—RedPajama-INCITE-Chat-3B on macOS—2023-05-22
- hexdump and hexdump -C—2023-05-22
- Exploring Baseline with Datasette Lite—2023-05-12
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024