Fun with binary data and SQLite
30th July 2020
This week I’ve been mainly experimenting with binary data storage in SQLite. sqlite-utils can now insert data from binary files, and datasette-media can serve content over HTTP that originated as binary BLOBs in a database file.
Paul Ford piqued my interest in this when he tweeted about loading thousands of PDF documents into a SQLite database:
I made a shell script that loads thousands of PDF docs into SQLite databases. That means I can have a web server that produces infinite randomly chosen US Military PowerPoint slides that I can scroll on my phone when I’m around the house.
— Paul Ford (@ftrain) July 26, 2020
The SQLite documentation claims that serving smaller binary files from BLOB columns can be 35% faster than the filesystem. I’ve done a little bit of work with binary files in SQLite—the datasette-render-binary and datasette-render-images both help display BLOB data—but I’d never really dug into it in much detail.
sqlite-utils insert-files
The first step was to make it easier to build database files that include binary data.
sqlite-utils is my combination Python library and CLI tool for building SQLite databases. I’ve been steadily evolving it for a couple of years now, and it’s the engine behind my Dogsheep collection of tools for personal analytics.
The new insert-files command can be used to insert content from binary files into a SQLite database, along with file metadata.
The most basic usage looks like this:
sqlite-utils insert-files gifs.db images *.gif
By default, this creates a database table like so:
CREATE TABLE [images] (
[path] TEXT PRIMARY KEY,
[content] BLOB,
[size] INTEGER
);
You can customize this table to include other file metadata using the -c
(short for --column
) option:
sqlite-utils insert-files gifs.db images *.gif \
-c path -c md5 -c last_modified:mtime -c size --pk=path
This creates a table with the following schema:
CREATE TABLE [images] (
[path] TEXT PRIMARY KEY,
[md5] TEXT,
[last_modified] FLOAT,
[size] INTEGER
);
If you pass a directory instead of a file name the command will recursively add every file in that directory.
I also improved sqlite-utils
with respect to outputting binary data. The new --raw
option dumps the binary contents of a column directly to standard out, so you can read an image back out of one of the above tables like so:
sqlite-utils photos.db \
"select content from images where path=:path" \
-p path 'myphoto.jpg' \
--raw > myphoto.jpg
This example also demonstrates the new support for :parameters
passed using the new -p
option, see #124.
sqlite-utils
usually communicates using JSON, but JSON doesn’t have the ability to represent binary values. Datasette outputs binary values like so:
"data": {
"$base64": true,
"encoded": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAY..."
}
I added support for the same format to sqlite-utils
—so you can now query binary columns and get out that nested object, or pipe JSON with that nested structure in to sqlite-utils insert
and have it stored as a binary BLOB in the database.
datasette-media
datasette-media is a plugin for serving binary content directly from Datasette on a special URL. I originally built it while working on Dogsheep Photos—given a SQLite file full of Apple Photos metadata I wanted to be able to serve thumbnails of the actual images via my Datasette web server.
Those photos were still stored on disk—the plugin lets you configure a SQL query like this which will cause hits to /-/media/photos/$UUID
to serve that file from disk:
{
"plugins": {
"datasette-media": {
"photo": {
"sql": "select filepath from apple_photos where uuid=:key"
}
}
}
}
Issue #14 added support for BLOB
columns as well. You can now configure the plugin like this to serve binary content that was stored in the database:
{
"plugins": {
"datasette-media": {
"thumb": {
"sql": "select content from thumbnails where uuid=:key"
}
}
}
}
This would serve content from a BLOB column in a thumbnails
table from the URL /-/media/thumb/$UUID
.
I really like this pattern of configuring plugins using SQL queries, where the returned column names have special meaning that is interpreted by the plugin. datasette-atom and datasette-ics use a similar trick.
I expanded datasette-media
with a few other related features:
- Return a
content_url
column and it will proxy content from that URL - Set
"enable_transform": true
for a media bucket to enable?w=
and?h=
and?format=
parameters for transforming the image before it is served to the user - Return a
content_filename
column to set a download file name (in acontent-disposition
HTTP header) prompting the user’s browser to download the file
See the README or release notes for more details.
Also this week
I renamed datasette-insert-api
to just datasette-insert, reflecting my plans to add non-API features to that plugin in the future.
In doing so I had to figure out how to rename a PyPI package such that dependent projects would continue to work. I ended up building a pypi-rename cookiecutter template encoding what I learned.
I enabled PostgreSQL full-text search for my blog’s Django Admin interface, and wrote a TIL on how I did it.
I added compound primary key support to db-to-sqlite
, so now it can convert PostgreSQL or MySQL databases to SQLite if they use compound primary keys.
TIL this week
- Implementing a “copy to clipboard” button
- PostgreSQL full-text search in the Django Admin
- SQLite BLOB literals
Releases this week
- sqlite-utils 2.13—2020-07-30
- datasette-media 0.5—2020-07-29
- db-to-sqlite 1.3—2020-07-27
- datasette-media 0.4—2020-07-27
- sqlite-utils 2.12—2020-07-27
- datasette-render-images 0.3.1—2020-07-27
- datasette-render-images 0.3—2020-07-27
- datasette-auth-passwords 0.3.1—2020-07-26
- datasette-insert 0.5—2020-07-25
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024