sqlite-comprehend: run AWS entity extraction against content in a SQLite database
11th July 2022
I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.
I created this as a complement to my s3-ocr tool, which uses AWS Textract service to perform OCR against every PDF file in an S3 bucket.
Short version: given a database table full of text, run the following:
% pip install sqlite-comprehend
% sqlite-comprehend entities myblog.db blog_entry body --strip-tags
[###---------------------------------] 9% 00:01:02
This will churn through every piece of text in the body
column of the blog_entry
table in the myblog.db
SQLite database, strip any HTML tags (the --strip-tags
option), submit it to AWS Comprehend, and store the extracted entities in the following tables:
- comprehend_entities—the extracted entities, classified by type
- blog_entry_comprehend_entities—a table relating entities to the entries that they appear in
- comprehend_entity_types—a small lookup table of entity types
The above table names link to a live demo produced by running the tool against all of the content in my blog.
Here are 225 mentions that Comprehend classified as the organization called “Mozilla”.
The tool tracks which rows have been processed already (in the blog_entry_comprehend_entities_done table), so you can run it multiple times and it will only process newly added rows.
AWS Comprehend pricing starts at $0.0001 per hundred characters. sqlite-comprehend
only submits the first 5,000 characters of each row.
How the demo works
My live demo for this tool uses a new Datasette instance at datasette.simonwillison.net. It hosts a complete copy of the data from my blog—data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.
The demo runs out of my simonwillisonblog-backup GitHub repository.
A couple of years ago I realized that I’m no longer happy having any content I care about not stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a git pull
.
So I created simonwillisonblog-backup
using a couple of my other tools: db-to-sqlite, which converts a PostgreSQL database to a SQLite database, and sqlite-diffable, which dumps out a SQLite database as a “diffable” directory of newline-delimited JSON files.
Here’s the simplest version of that pattern:
db-to-sqlite \
'postgresql+psycopg2://user:pass@hostname:5432/dbname' \
simonwillisonblog.db --all
This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in simonwillisonblog.db
.
sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all
This converts that SQLite database into a directory of JSON files. Each table gets two files: table.metadata.json
containing the table’s name, columns and schema and table.ndjson
containing a newline-separated list of JSON arrays representing every row in that table.
You can see these files for my blog’s database in the simonwillisonblog folder.
(My actual script is a little more complex, because I backup only selected tables and then run extra code to redact some of the fields.)
Since I set this up it’s captured over 600 changes I’ve applied to my blog’s database, all made the regular Django admin interface.
This morning I extended the script to run sqlite-comprehend
against my blog entries and deploy the resulting data using Datasette.
The concise version of the new script looks like this:
wget -q https://datasette.simonwillison.net/simonwillisonblog.db
This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.
sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace
This creates the simonwillisonblog.db
database by loading in the JSON from the simonwillisonblog/
folder. I do it this way mainly to exercise the new sqlite-diffable load command I just added to that tool.
The --replace
option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).
sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags
This runs sqlite-comprehend
against the blog entries that have not yet been processed.
set +e
sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2>/dev/null
set -e
This configures SQLite full-text search against each of those tables, using this pattern to supress any errors that occur if the FTS tables already exist.
Setting up FTS in this way means I can use the datasette-search-all plugin to run searches like this one for aws across all of those tables at once.
datasette publish cloudrun simonwillisonblog.db \
-m metadata.yml \
--service simonwillisonblog \
--install datasette-block-robots \
--install datasette-graphql \
--install datasette-search-all
This uses the using datasette publish command to deploy the datasette.simonwillison.net site to Google Cloud Run.
I’m adding two more plugins here: datasette-block-robots to avoid search engine crawlers indexing a duplicate copy of my blog’s content, and datasette-graphql to enable GraphQL queries against my data.
Here’s an example GraphQL query that returns my most recent blog entries that are tagged with datasette
.
Releases this week
-
sqlite-comprehend: 0.2.1—(4 releases total)—2022-07-11
Tools for running data in a SQLite database through AWS Comprehend -
sqlite-diffable: 0.4—(5 releases total)—2022-07-11
Tools for dumping/loading a SQLite database to diffable directory structure -
datasette-redirect-to-https: 0.2—(2 releases total)—2022-07-04
Datasette plugin that redirects all non-https requests to https -
datasette-unsafe-expose-env: 0.1.1—(2 releases total)—2022-07-03
Datasette plugin to expose some environment variables at /-/env for debugging -
datasette-expose-env: 0.1—2022-07-03
Datasette plugin to expose selected environment variables at /-/env for debugging -
datasette-upload-csvs: 0.7.2—(10 releases total)—2022-07-03
Datasette plugin for uploading CSV files and converting them to database tables -
datasette-packages: 0.2—(3 releases total)—2022-07-03
Show a list of currently installed Python packages -
datasette-graphql: 2.1—(35 releases total)—2022-07-03
Datasette plugin providing an automatic GraphQL API for your SQLite databases -
datasette-edit-schema: 0.5—(9 releases total)—2022-07-01
Datasette plugin for modifying table schemas
TIL this week
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024