sqlite-utils: a Python library and CLI tool for building SQLite databases
25th February 2019
sqlite-utils is a combination Python library and command-line tool I’ve been building over the past six months which aims to make creating new SQLite databases as quick and easy as possible.
It’s part of the ecosystem of tools I’m building around my Datasette project.
I spent the weekend adding all kinds of exciting command-line options to it, so I’m ready to describe it to the world.
A Python library for quickly creating databases
A core idea behind Datasette is that SQLite is the ideal format for publishing all kinds of interesting structured data. Datasette takes any SQLite database and adds a browsable web interface, a JSON API and the ability to export tables and queries as CSV.
The other half of the equation then is tools to create SQLite databases. csvs-to-sqlite was my first CLI attempt at this. sqlite-utils
takes a much more flexible and comprehensive approach.
I started working on sqlite-utils
last year as part of my project to Analyze US Election Russian Facebook Ads. The initial aim was to build a library that made constructing new SQLite databases inside of a Jupyter notebook as productive as possible.
The core idea behind the library is that you can give it a list of Python dictionaries (equivalent to JSON objects) and it will automatically create a SQLite table with the correct schema, then insert those items into the new table.
To illustrate, let’s create a database using this JSON file of meteorite landings released by NASA (discovered via awesome-json-datasets curated by Justin Dorfman).
Here’s the quickest way in code to turn that into a database:
import requests
import sqlite_utils
db = sqlite_utils.Database("meteorites.db")
db["meteorites"].insert_all(
requests.get(
"https://data.nasa.gov/resource/y77d-th95.json"
).json(),
pk="id"
)
This two lines of code creates a new SQLite database on disk called meteorites.db
, creates a table in that file called meteorites
, detects the necessary columns based on the incoming data, inserts all of the rows and sets the id
column up as the primary key.
To see the resulting database, run datasette meteorites.db
and browse to http://127.0.0.1:8001/
.
You can do a lot more with the library. You can create tables, insert and upsert data in bulk, configure foreign key relationships, configure SQLite full-text search and much more. I encourage you to consult the documentation for all of the details.
The sqlite-utils command-line tool
This is the new stuff built over the past few days, and I think it’s really fun.
First install the tool from PyPI, using pip3 install sqlite-utils
.
Let’s start by grabbing a copy of the russian-ads.db database I created in Analyzing US Election Russian Facebook Ads (4MB):
$ cd /tmp
$ wget https://static.simonwillison.net/static/2019/russian-ads.db
We can see a list of tables in the database and their counts using the tables
command:
$ sqlite-utils tables russian-ads.db --counts
[{"table": "ads", "count": 3498},
{"table": "targets", "count": 1665},
{"table": "ad_targets", "count": 36559},
{"table": "ads_fts", "count": 3498},
{"table": "ads_fts_segments", "count": 120},
{"table": "ads_fts_segdir", "count": 1},
{"table": "ads_fts_docsize", "count": 3498},
{"table": "ads_fts_stat", "count": 1}]
By default, sqlite-utils
outputs data as neatly formatted JSON. You can get CSV instead using the --csv
option:
$ sqlite-utils tables russian-ads.db --counts --csv
table,count
ads,3498
targets,1665
ad_targets,36559
ads_fts,3498
ads_fts_segments,120
ads_fts_segdir,1
ads_fts_docsize,3498
ads_fts_stat,1
Or if you want a pretty ASCII-art table, use --table
(or the shortcut, -t
):
$ sqlite-utils tables russian-ads.db --counts -t
table count
---------------- -------
ads 3498
targets 1665
ad_targets 36559
ads_fts 3498
ads_fts_segments 120
ads_fts_segdir 1
ads_fts_docsize 3498
ads_fts_stat 1
The table view is built on top of tabulate, which offers dozens of table variations. Run sqlite-utils tables --help
for the full list—try --table -fmt=rst
for output that can be pasted directly into a reStructuredText document (handy for writing documentation).
So far we’ve just looked at a list of tables. Lets run a SQL query:
$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category"
[{"category": "accessing_facebook_on", "count(*)": 1},
{"category": "age", "count(*)": 82},
{"category": "and_must_also_match", "count(*)": 228},
{"category": "army_reserve_industry", "count(*)": 3},
{"category": "behaviors", "count(*)": 16},
...
Again, this can be output as CSV using --csv
, or a table with --table
.
The default JSON output is objects wrapped in an array. Use --arrays
to get an array of arrays instead. More interestingly: --nl
causes the data to be output as newline-delimited JSON, like this:
$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category" --nl
{"category": "accessing_facebook_on", "count(*)": 1}
{"category": "age", "count(*)": 82}
{"category": "and_must_also_match", "count(*)": 228}
{"category": "army_reserve_industry", "count(*)": 3}
{"category": "behaviors", "count(*)": 16}
...
This is a really interesting format for piping to other tools.
Creating databases from JSON on the command-line
The sqlite-utils insert
command can be used to create new tables by piping JSON or CSV directly into the tool. It’s the command-line equivalent of the .insert_all()
Python function I demonstrated earlier.
Here’s how to create that meteorite database directly from the command-line:
$ curl "https://data.nasa.gov/resource/y77d-th95.json" | \
sqlite-utils insert meteorites.db meteorites - --pk=id
This will use a SQLite database file called meteorites.db
(creating one if it does not yet exist), create or use a table called meteorites
and read the data from standard in (hence the pipe). You can pass a filename instead of a -
here to read data from a file on disk.
The insert
command accepts multiple formats—it defaults to expecting a JSON array of objects, but you can use --nl
to accept newline-delimited JSON and --csv
to accept CSV.
This means you can combine the tools! Let’s create a brand new database by exporting data from the old one, using newline-delimited JSON as the intermediary format:
$ sqlite-utils russian-ads.db \
"select * from ads where text like '%veterans%'" --nl | \
sqlite-utils insert veterans.db ads - --nl
This creates a new file called veterans.db
containing an ads
table with just the ads that mentioned veterans somewhere in their body text.
Since we’re working with JSON, we can introduce other command-line tools into the mix.
jq is a neat little tool for extracting data from a JSON file using its own mini domain-specific language.
The Nobel Prize API offers a JSON file listing all of the Nobel laureates—but they are contained as an array in a top level "laureates"
key. sqlite-utils
needs a flat array—so we can use jq
to get exactly that:
$ curl "https://api.nobelprize.org/v1/laureate.json" | \
jq ".laureates" | \
sqlite-utils insert nobel.db laureates -
Now we have a file called nobel.db
containing all of the Nobel laureates.
Since Datasette recently grew the ability to export newline-delimited JSON, we can also use this ability to directly consume data from Datasette. Lets grab every episode of the Joy of Painting in which Bob Ross painted a beach:
$ curl "https://fivethirtyeight.datasettes.com/fivethirtyeight-aa93d24/bob-ross%2Felements-by-episode.json?_facet=BEACH&BEACH=1&_shape=array&_nl=on" \
| sqlite-utils insert bob.db beach_episodes - --nl
Plenty more features
As with the Python API, the sqlite-utils
CLI tool has dozens of other options and extensive documentation.
I’ve been really enjoying growing an ecosystem of tools around Datasette. sqlite-utils
is the keystone here: it’s fundamental to other tools I’m building, such as db-to-sqlite (which can export any SQLAlchemy-supported database directly to a SQLite file on disk).
I’ve found myself increasingly turning to SQLite first for all kinds of ad-hoc analysis, and I’m excited to try out these new command-line abilities of sqlite-utils
for real-world data spelunking tasks.
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024