sqlite-history: tracking changes to SQLite tables using triggers (also weeknotes)
15th April 2023
In between blogging about ChatGPT rhetoric, micro-benchmarking with ChatGPT Code Interpreter and Why prompt injection is an even bigger problem now I managed to ship the beginnings of a new project: sqlite-history.
sqlite-history
Recording changes made to a database table is a problem that has popped up consistently throughout my entire career. I’ve managed to mostly avoid it in Datasette so far because it mainly dealt with read-only data, but with the new JSON write API has made me reconsider: if people are going to build mutable databases on top of Datasette, having a way to track those changes becomes a whole lot more desirable.
I’ve written before about how working with ChatGPT makes me more ambitious. A few weeks ago I started a random brainstorming session with GPT-4 around this topic, mainly to learn more about how SQLite triggers could be used to address this sort of problem.
Here’s the resulting transcript. It turns out ChatGPT makes for a really useful brainstorming partner.
Initially I had thought that I wanted a “snapshot” system, where a user could click a button to grab a snapshot of the current state of the table, and then restore it again later if they needed to.
I quickly realized that a system for full change tracking would be easier to build, and provide more value to users.
sqlite-history 0.1 is the first usable version of this system. It’s still very early and should be treated as unstable software, but initial testing results have been very positive so far.
The key idea is that for each table that is tracked, a separate _tablename_history
table is created. For example:
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER,
weight REAL
);
Gets a history table like this:
CREATE TABLE _people_history (
_rowid INTEGER,
id INTEGER,
name TEXT,
age INTEGER,
weight REAL,
_version INTEGER,
_updated INTEGER,
_mask INTEGER
);
CREATE INDEX idx_people_history_rowid ON _people_history (_rowid);
AS you can see, the history table includes the columns from the original table, plus four extra integer columns for tracking different things:
-
_rowid
corresponds to the SQLiterowid
of the parent table—which is automatically and invisibly created for all SQLite tables. This is how history records map back to their corresponding row. -
_version
is an incrementing version number for each individal tracked row -
_updated
records a millisecond-precision timestamp for when the row was updated—see this TIL. -
_mask
is an integer bitmap recording which columns in the row were updated in a specific change.
The _mask
column is particularly important to this design.
The simplest way to implement history is to create a full copy of the previous state of a row every time it is updated.
This has a major downside: if the rows include large amounts of content—a content_html
column on a blog for example—you end up storing a full copy of that data every time you make an edit, even if it was just a tweak to a headline.
I didn’t want to duplicate that much data.
An alternative approach is to store null
for any column that didn’t change since the previous version. This saves on space, but introduces a new challenge: what if the user updated a column and set the new value to null
? That change would be indistinguishable from no change at all.
My solution then is to use this _mask
column. Every column in the table gets a power-of-two number—1, 2, 4, 8 for id
, name
, age
and weight
respectively.
The _mask
then records the sum of those numbers as a bitmask. In this way, the _history
row need only store information for columns that have changed, with an overhead of just four extra integer columns to record the metadata about that change.
Populating this history table can now be handled entirely using SQLite triggers. Here they are:
CREATE TRIGGER people_insert_history
AFTER INSERT ON people
BEGIN
INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
VALUES (new.rowid, new.id, new.name, new.age, new.weight, 1, cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer), 15);
END;
CREATE TRIGGER people_update_history
AFTER UPDATE ON people
FOR EACH ROW
BEGIN
INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
SELECT old.rowid,
CASE WHEN old.id != new.id then new.id else null end,
CASE WHEN old.name != new.name then new.name else null end,
CASE WHEN old.age != new.age then new.age else null end,
CASE WHEN old.weight != new.weight then new.weight else null end,
(SELECT MAX(_version) FROM _people_history WHERE _rowid = old.rowid) + 1,
cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer),
(CASE WHEN old.id != new.id then 1 else 0 end) + (CASE WHEN old.name != new.name then 2 else 0 end) + (CASE WHEN old.age != new.age then 4 else 0 end) + (CASE WHEN old.weight != new.weight then 8 else 0 end)
WHERE old.id != new.id or old.name != new.name or old.age != new.age or old.weight != new.weight;
END;
CREATE TRIGGER people_delete_history
AFTER DELETE ON people
BEGIN
INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
VALUES (
old.rowid,
old.id, old.name, old.age, old.weight,
(SELECT COALESCE(MAX(_version), 0) from _people_history WHERE _rowid = old.rowid) + 1,
cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer),
-1
);
END;
There are a couple of extra details here. The insert
trigger records a full copy of the row when it is first inserted, with a version number of 1.
The update
trigger is the most complicated. It includes some case
statements to populate the correct columns, and then a big case
statement at the end to add together the integers for that _mask
bitmask column.
The delete
trigger records the record that has just been deleted and sets the _mask
column to -1
as a way of marking it as a deletion. That idea was suggested by GPT-4!
Writing these triggers out by hands would be pretty arduous... so the sqlite-history
repository contains a Python library and CLI tool that can create those triggers automatically, either for specific tables:
python -m sqlite_history data.db table1 table2 table3
Or for all tables at once (excluding things like FTS tables):
python -m sqlite_history data.db --all
There are still a bunch of problems I want to solve. Open issues right now are:
- Functions for restoring tables or individual rows enhancement—recording history is a lot more interesting if you can easily restore from it! GPT-4 wrote a recursive CTE for this but I haven’t fully verified that it does the right thing yet.
-
Try saving space by not creating full duplicate history row until first edit—currently the insert trigger instantly creates a duplicate of the full row, doubling the amount of storage space needed. I’m contemplating a change where that first record would contain just
null
values, and then the first time a row was updated a record would be created containing the full original copy. -
Document how to handle alter table. Originally I had thought that altering a table would by necessity invalidate the history recorded so far, but I’ve realized that the
_mask
mechanism might actually be compatible with a subset of alterations—anything that adds a new column to the end of an existing table could work OK, since that column would get a new, incrementally larger mask value without disrupting previous records.
I’m also thinking about building a Datasette plugin on top of this library, to make it really easy to start tracking history of tables in an existing Datasette application.
Entries this week
- Prompt injection: what’s the worst that can happen?
- Running Python micro-benchmarks using the ChatGPT Code Interpreter alpha
- Thoughts on AI safety in this era of increasingly powerful open source LLMs
- The Changelog podcast: LLMs break the internet
- Path to Citus Con: Working in public
- We need to tell people ChatGPT will lie to them, not debate linguistics
Museums this week
Releases this week
-
asyncinject 0.6—2023-04-14
Run async workflows using pytest-fixtures-style dependency injection -
swarm-to-sqlite 0.3.4—2023-04-11
Create a SQLite database containing your checkin history from Foursquare Swarm -
sqlite-history 0.1—2023-04-09
Track changes to SQLite tables using triggers
TIL this week
- Running Dolly 2.0 on Paperspace—2023-04-12
- Creating desktop backgrounds using Midjourney—2023-04-10
- Unix timestamp in milliseconds in SQLite—2023-04-09
- Saving an in-memory SQLite database to a file in Python—2023-04-09
- GPT-4 for API design research—2023-04-06
More recent articles
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024
- Visualizing local election results with Datasette, Observable and MapLibre GL - 9th November 2024
- Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - 7th November 2024