Joining CSV and JSON data with an in-memory SQLite database

19th June 2021

The new sqlite-utils memory command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.

sqlite-utils memory

The new feature is part of sqlite-utils 3.10, which I released this morning. You can install it using brew install sqlite-utils or pip install sqlite-utils.

I’ve recorded this video demonstrating the new feature—with full accompanying notes below.

sqlite-utils already offers a mechanism for importing CSV and JSON data into a SQLite database file, in the form of the sqlite-utils insert command. Processing data with this involves two steps: first import it into a temp.db file, then use sqlite-utils query to run queries and output the results.

Using SQL to re-shape data is really useful—since sqlite-utils can output in multiple different formats, I frequently find myself loading in a CSV file and exporting it back out as JSON, or vice-versa.

This week I realized that I had most of the pieces in place to reduce this to a single step. The new sqlite-utils memory command (full documentation here) operates against a temporary, in-memory SQLite database. It can import data, execute SQL and output the result in a one-liner, without needing any temporary database files along the way.

Here’s an example. My Dogsheep GitHub organization has a number of repositories. GitHub make those available via an authentication-optional API endpoint at https://api.github.com/users/dogsheep/repos—which returns JSON that looks like this (simplified):

[
  {
    "id": 197431109,
    "name": "dogsheep-beta",
    "full_name": "dogsheep/dogsheep-beta",
    "size": 61,
    "stargazers_count": 79,
    "watchers_count": 79,
    "forks": 0,
    "open_issues": 11
  },
  {
    "id": 256834907,
    "name": "dogsheep-photos",
    "full_name": "dogsheep/dogsheep-photos",
    "size": 64,
    "stargazers_count": 116,
    "watchers_count": 116,
    "forks": 5,
    "open_issues": 18
  }
]

With sqlite-utils memory we can see the 3 most popular repos by number of stars like this:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' -t
full_name                     forks_count    stars
--------------------------  -------------  -------
dogsheep/twitter-to-sqlite             12      225
dogsheep/github-to-sqlite              14      139
dogsheep/dogsheep-photos                5      116

We’re using curl to fetch the JSON and pipe it into sqlite-utils memory—the - means “read from standard input”. Then we pass the following SQL query:

select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3

stdin is the temporary table created for the data piped in to the tool. The query selects three of the JSON properties, renames stargazers_count to stars, sorts by stars and return the first three.

The -t option here means “output as a formatted table”—without that option we get JSON:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    '  
[{"full_name": "dogsheep/twitter-to-sqlite", "forks_count": 12, "stars": 225},
 {"full_name": "dogsheep/github-to-sqlite", "forks_count": 14, "stars": 139},
 {"full_name": "dogsheep/dogsheep-photos", "forks_count": 5, "stars": 116}]

Or we can use --csv to get back CSV:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' --csv
full_name,forks_count,stars
dogsheep/twitter-to-sqlite,12,225
dogsheep/github-to-sqlite,14,139
dogsheep/dogsheep-photos,5,116

The -t option supports a number of different formats, specified using --fmt. If I wanted to generate a LaTeX table of the top repos by stars I could do this:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' -t --fmt=latex
\begin{tabular}{lrr}
\hline
 full\_name                  &   forks\_count &   stars \\
\hline
 dogsheep/twitter-to-sqlite &            12 &     225 \\
 dogsheep/github-to-sqlite  &            14 &     139 \\
 dogsheep/dogsheep-photos   &             5 &     116 \\
\hline
\end{tabular}

We can run aggregate queries too—let’s add up the total size and total number of stars across all of those repositories:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
    select sum(size), sum(stargazers_count) from stdin
' -t
  sum(size)    sum(stargazers_count)
-----------  -----------------------
        843                      934

(I believe size here is measured in kilobytes: the GitHub API documentation isn’t clear on this point.)

Joining across different files

All of these examples have worked with JSON data piped into the tool—but you can also pass one or more files, of different formats, in a way that lets you execute joins against them.

As an example, let’s combine two sources of data.

The New York Times publish a us-states.csv file with Covid cases and deaths by state over time.

The CDC have an undocumented JSON endpoint (which I’ve been archiving here) tracking the progress of vaccination across different states.

We’re going to run a join from that CSV data to that JSON data, and output a table of results.

First, we need to download the files. The CDC JSON data isn’t quite in the right shape for our purposes:

{
  "runid": 2023,
  "vaccination_data": [
    {
      "Date": "2021-06-19",
      "Location": "US",
      "ShortName": "USA",
      ...

sqlite-utils expects a flat JSON array of objects—we can use jq to re-shape the data like so:

$ curl https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data \
  | jq .vaccination_data > vaccination_data.json

The New York Times data is good as is:

$ wget 'https://github.com/nytimes/covid-19-data/raw/master/us-states.csv'

Now that we have the data locally, we can run a join to combine it using the following command:

$ sqlite-utils memory us-states.csv vaccination_data.json "
  select
    max(t1.date),
    t1.state,
    t1.cases,
    t1.deaths,
    t2.Census2019,
    t2.Dist_Per_100K
  from
    t1
      join t2 on t1.state = replace(t2.LongName, 'New York State', 'New York')
  group by
    t1.state
  order by
    Dist_Per_100K desc
" -t
max(t1.date)    state                       cases    deaths    Census2019    Dist_Per_100K
--------------  ------------------------  -------  --------  ------------  ---------------
2021-06-18      District of Columbia        49243      1141        705749           149248
2021-06-18      Vermont                     24360       256        623989           146257
2021-06-18      Rhode Island               152383      2724       1059361           141291
2021-06-18      Massachusetts              709263     17960       6892503           139692
2021-06-18      Maryland                   461852      9703       6045680           138193
2021-06-18      Maine                       68753       854       1344212           136894
2021-06-18      Hawaii                      35903       507       1415872           136024
...

I’m using automatically created numeric aliases t1 and t2 for the files here, but I can also use their full table names "us-states" (quotes needed due to the hyphen) and vaccination_data instead.

The replace() operation there is needed because the vaccination_data.json file calls New York “New York State” while the us-states.csv file just calls it “New York”.

The max(t1.date) and group by t1.state is a useful SQLite trick: if you perform a group by and then ask for the max() of a value, the other columns returned from that table will be the columns for the row that contains that maximum value.

This demo is a bit of a stretch—once I reach this level of complexity I’m more likely to load the files into a SQLite database file on disk and open them up in Datasette—but it’s a fun example of a more complex join in action.

Also in sqlite-utils 3.10

The sqlite-utils memory command has another new trick up its sleeve: it automatically detects which columns in a CSV or TSV file contain integer or float values and creates the corresponding in-memory SQLite table with the correct types. This ensures max() and sum() and order by work in a predictable manner, without accidentally sorting 1 as higher than 11.

I didn’t want to break backwards compatibility for existing users of the sqlite-utils insert command so I’ve added type detection there as a new option, --detect-types or -d for short:

$ sqlite-utils insert my.db us_states us-states.csv --csv -d
  [####################################]  100%
$ sqlite-utils schema my.db
CREATE TABLE "us_states" (
   [date] TEXT,
   [state] TEXT,
   [fips] INTEGER,
   [cases] INTEGER,
   [deaths] INTEGER
);

There’s more in the changelog.

Releases this week

sqlite-utils: 3.10—(78 releases total)—2021-06-19
Python CLI utility and library for manipulating SQLite databases
dogsheep-beta: 0.10.2—(20 releases total)—2021-06-13
Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette
yaml-to-sqlite: 1.0—(5 releases total)—2021-06-13
Utility for converting YAML files to SQLite
markdown-to-sqlite: 1.0—(2 releases total)—2021-06-13
CLI tool for loading markdown files into a SQLite database

TIL this week

Mouse support in vim

Posted 19th June 2021 at 10:55 pm · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog