Joining CSV and JSON data with an in-memory SQLite database
19th June 2021
The new sqlite-utils memory
command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.
sqlite-utils memory
The new feature is part of sqlite-utils 3.10, which I released this morning. You can install it using brew install sqlite-utils
or pip install sqlite-utils
.
I’ve recorded this video demonstrating the new feature—with full accompanying notes below.
sqlite-utils
already offers a mechanism for importing CSV and JSON data into a SQLite database file, in the form of the sqlite-utils insert command. Processing data with this involves two steps: first import it into a temp.db
file, then use sqlite-utils query to run queries and output the results.
Using SQL to re-shape data is really useful—since sqlite-utils
can output in multiple different formats, I frequently find myself loading in a CSV file and exporting it back out as JSON, or vice-versa.
This week I realized that I had most of the pieces in place to reduce this to a single step. The new sqlite-utils memory
command (full documentation here) operates against a temporary, in-memory SQLite database. It can import data, execute SQL and output the result in a one-liner, without needing any temporary database files along the way.
Here’s an example. My Dogsheep GitHub organization has a number of repositories. GitHub make those available via an authentication-optional API endpoint at https://api.github.com/users/dogsheep/repos—which returns JSON that looks like this (simplified):
[
{
"id": 197431109,
"name": "dogsheep-beta",
"full_name": "dogsheep/dogsheep-beta",
"size": 61,
"stargazers_count": 79,
"watchers_count": 79,
"forks": 0,
"open_issues": 11
},
{
"id": 256834907,
"name": "dogsheep-photos",
"full_name": "dogsheep/dogsheep-photos",
"size": 64,
"stargazers_count": 116,
"watchers_count": 116,
"forks": 5,
"open_issues": 18
}
]
With sqlite-utils memory
we can see the 3 most popular repos by number of stars like this:
$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
' -t
full_name forks_count stars
-------------------------- ------------- -------
dogsheep/twitter-to-sqlite 12 225
dogsheep/github-to-sqlite 14 139
dogsheep/dogsheep-photos 5 116
We’re using curl
to fetch the JSON and pipe it into sqlite-utils memory
—the -
means “read from standard input”. Then we pass the following SQL query:
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
stdin
is the temporary table created for the data piped in to the tool. The query selects three of the JSON properties, renames stargazers_count
to stars
, sorts by stars and return the first three.
The -t
option here means “output as a formatted table”—without that option we get JSON:
$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
'
[{"full_name": "dogsheep/twitter-to-sqlite", "forks_count": 12, "stars": 225},
{"full_name": "dogsheep/github-to-sqlite", "forks_count": 14, "stars": 139},
{"full_name": "dogsheep/dogsheep-photos", "forks_count": 5, "stars": 116}]
Or we can use --csv
to get back CSV:
$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
' --csv
full_name,forks_count,stars
dogsheep/twitter-to-sqlite,12,225
dogsheep/github-to-sqlite,14,139
dogsheep/dogsheep-photos,5,116
The -t
option supports a number of different formats, specified using --fmt
. If I wanted to generate a LaTeX table of the top repos by stars I could do this:
$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3
' -t --fmt=latex
\begin{tabular}{lrr}
\hline
full\_name & forks\_count & stars \\
\hline
dogsheep/twitter-to-sqlite & 12 & 225 \\
dogsheep/github-to-sqlite & 14 & 139 \\
dogsheep/dogsheep-photos & 5 & 116 \\
\hline
\end{tabular}
We can run aggregate queries too—let’s add up the total size and total number of stars across all of those repositories:
$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
select sum(size), sum(stargazers_count) from stdin
' -t
sum(size) sum(stargazers_count)
----------- -----------------------
843 934
(I believe size here is measured in kilobytes: the GitHub API documentation isn’t clear on this point.)
Joining across different files
All of these examples have worked with JSON data piped into the tool—but you can also pass one or more files, of different formats, in a way that lets you execute joins against them.
As an example, let’s combine two sources of data.
The New York Times publish a us-states.csv file with Covid cases and deaths by state over time.
The CDC have an undocumented JSON endpoint (which I’ve been archiving here) tracking the progress of vaccination across different states.
We’re going to run a join from that CSV data to that JSON data, and output a table of results.
First, we need to download the files. The CDC JSON data isn’t quite in the right shape for our purposes:
{
"runid": 2023,
"vaccination_data": [
{
"Date": "2021-06-19",
"Location": "US",
"ShortName": "USA",
...
sqlite-utils
expects a flat JSON array of objects—we can use jq to re-shape the data like so:
$ curl https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data \
| jq .vaccination_data > vaccination_data.json
The New York Times data is good as is:
$ wget 'https://github.com/nytimes/covid-19-data/raw/master/us-states.csv'
Now that we have the data locally, we can run a join to combine it using the following command:
$ sqlite-utils memory us-states.csv vaccination_data.json "
select
max(t1.date),
t1.state,
t1.cases,
t1.deaths,
t2.Census2019,
t2.Dist_Per_100K
from
t1
join t2 on t1.state = replace(t2.LongName, 'New York State', 'New York')
group by
t1.state
order by
Dist_Per_100K desc
" -t
max(t1.date) state cases deaths Census2019 Dist_Per_100K
-------------- ------------------------ ------- -------- ------------ ---------------
2021-06-18 District of Columbia 49243 1141 705749 149248
2021-06-18 Vermont 24360 256 623989 146257
2021-06-18 Rhode Island 152383 2724 1059361 141291
2021-06-18 Massachusetts 709263 17960 6892503 139692
2021-06-18 Maryland 461852 9703 6045680 138193
2021-06-18 Maine 68753 854 1344212 136894
2021-06-18 Hawaii 35903 507 1415872 136024
...
I’m using automatically created numeric aliases t1
and t2
for the files here, but I can also use their full table names "us-states"
(quotes needed due to the hyphen) and vaccination_data
instead.
The replace()
operation there is needed because the vaccination_data.json
file calls New York “New York State” while the us-states.csv
file just calls it “New York”.
The max(t1.date)
and group by t1.state
is a useful SQLite trick: if you perform a group by
and then ask for the max()
of a value, the other columns returned from that table will be the columns for the row that contains that maximum value.
This demo is a bit of a stretch—once I reach this level of complexity I’m more likely to load the files into a SQLite database file on disk and open them up in Datasette—but it’s a fun example of a more complex join in action.
Also in sqlite-utils 3.10
The sqlite-utils memory
command has another new trick up its sleeve: it automatically detects which columns in a CSV or TSV file contain integer or float values and creates the corresponding in-memory SQLite table with the correct types. This ensures max()
and sum()
and order by
work in a predictable manner, without accidentally sorting 1
as higher than 11
.
I didn’t want to break backwards compatibility for existing users of the sqlite-utils insert
command so I’ve added type detection there as a new option, --detect-types
or -d
for short:
$ sqlite-utils insert my.db us_states us-states.csv --csv -d
[####################################] 100%
$ sqlite-utils schema my.db
CREATE TABLE "us_states" (
[date] TEXT,
[state] TEXT,
[fips] INTEGER,
[cases] INTEGER,
[deaths] INTEGER
);
There’s more in the changelog.
Releases this week
-
sqlite-utils: 3.10—(78 releases total)—2021-06-19
Python CLI utility and library for manipulating SQLite databases -
dogsheep-beta: 0.10.2—(20 releases total)—2021-06-13
Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette -
yaml-to-sqlite: 1.0—(5 releases total)—2021-06-13
Utility for converting YAML files to SQLite -
markdown-to-sqlite: 1.0—(2 releases total)—2021-06-13
CLI tool for loading markdown files into a SQLite database
TIL this week
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024