94 items tagged “sql”
2024
Dashboard: Tools. I used Django SQL Dashboard to spin up a dashboard that shows all of the URLs to my tools.simonwillison.net site that I've shared on my blog so far. It uses this (Claude assisted) regular expression in a PostgreSQL SQL query:
select distinct on (tool_url)
unnest(regexp_matches(
body,
'(https://tools\.simonwillison\.net/[^<"\s)]+)',
'g'
)) as tool_url,
'https://simonwillison.net/' || left(type, 1) || '/' || id as blog_url,
title,
date(created) as created
from content
I've been really enjoying having a static hosting platform (it's GitHub Pages serving my simonw/tools repo) that I can use to quickly deploy little HTML+JavaScript interactive tools and demos.
PostgreSQL 17: SQL/JSON is here! (via) Hubert Lubaczewski dives into the new JSON features added in PostgreSQL 17, released a few weeks ago on the 26th of September. This is the latest in his long series of similar posts about new PostgreSQL features.
The features are based on the new SQL:2023 standard from June 2023. If you want to actually read the specification for SQL:2023 it looks like you have to buy a PDF from ISO for 194 Swiss Francs (currently $226). Here's a handy summary by Peter Eisentraut: SQL:2023 is finished: Here is what's new.
There's a lot of neat stuff in here. I'm particularly interested in the json_table()
table-valued function, which can convert a JSON string into a table with quite a lot of flexibility. You can even specify a full table schema as part of the function call:
SELECT * FROM json_table(
'[{"a":10,"b":20},{"a":30,"b":40}]'::jsonb,
'$[*]'
COLUMNS (
id FOR ORDINALITY,
column_a int4 path '$.a',
column_b int4 path '$.b',
a int4,
b int4,
c text
)
);
SQLite has solid JSON support already and often imitates PostgreSQL features, so I wonder if we'll see an update to SQLite that reflects some aspects of this new syntax.
Hybrid full-text search and vector search with SQLite. As part of Alex’s work on his sqlite-vec SQLite extension - adding fast vector lookups to SQLite - he’s been investigating hybrid search, where search results from both vector similarity and traditional full-text search are combined together.
The most promising approach looks to be Reciprocal Rank Fusion, which combines the top ranked items from both approaches. Here’s Alex’s SQL query:
-- the sqlite-vec KNN vector search results
with vec_matches as (
select
article_id,
row_number() over (order by distance) as rank_number,
distance
from vec_articles
where
headline_embedding match lembed(:query)
and k = :k
),
-- the FTS5 search results
fts_matches as (
select
rowid,
row_number() over (order by rank) as rank_number,
rank as score
from fts_articles
where headline match :query
limit :k
),
-- combine FTS5 + vector search results with RRF
final as (
select
articles.id,
articles.headline,
vec_matches.rank_number as vec_rank,
fts_matches.rank_number as fts_rank,
-- RRF algorithm
(
coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts +
coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
) as combined_rank,
vec_matches.distance as vec_distance,
fts_matches.score as fts_score
from fts_matches
full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
order by combined_rank desc
)
select * from final;
I’ve been puzzled in the past over how to best do that because the distance scores from vector similarity and the relevance scores from FTS are meaningless in comparison to each other. RRF doesn’t even attempt to compare them - it uses them purely for row_number()
ranking within each set and combines the results based on that.
My goal is to keep SQLite relevant and viable through the year 2050. That's a long time from now. If I knew that standard SQL was not going to change any between now and then, I'd go ahead and make non-standard extensions that allowed for FROM-clause-first queries, as that seems like a useful extension. The problem is that standard SQL will not remain static. Probably some future version of "standard SQL" will support some kind of FROM-clause-first query format. I need to ensure that whatever SQLite supports will be compatible with the standard, whenever it drops. And the only way to do that is to support nothing until after the standard appears.
When will that happen? A month? A year? Ten years? Who knows.
I'll probably take my cue from PostgreSQL. If PostgreSQL adds support for FROM-clause-first queries, then I'll do the same with SQLite, copying the PostgreSQL syntax. Until then, I'm afraid you are stuck with only traditional SELECT-first queries in SQLite.
SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL (via) A new paper from Google Research describing custom syntax for analytical SQL queries that has been rolling out inside Google since February, reaching 1,600 "seven-day-active users" by August 2024.
A key idea is here is to fix one of the biggest usability problems with standard SQL: the order of the clauses in a query. Starting with SELECT
instead of FROM
has always been confusing, see SQL queries don't start with SELECT by Julia Evans.
Here's an example of the new alternative syntax, taken from the Pipe query syntax documentation that was added to Google's open source ZetaSQL project last week.
For this SQL query:
SELECT component_id, COUNT(*)
FROM ticketing_system_table
WHERE
assignee_user.email = 'username@email.com'
AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
GROUP BY component_id
ORDER BY component_id DESC;
The Pipe query alternative would look like this:
FROM ticketing_system_table
|> WHERE
assignee_user.email = 'username@email.com'
AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
|> AGGREGATE COUNT(*)
GROUP AND ORDER BY component_id DESC;
The Google Research paper is released as a two-column PDF. I snarked about this on Hacker News:
Google: you are a web company. Please learn to publish your research papers as web pages.
This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from, have poor accessibility (see this Mastodon conversation) and are generally just bad citizens of the web.
Having complained about this I felt compelled to see if I could address it myself. Google's own Gemini Pro 1.5 model can process PDFs, so I uploaded the PDF to Google AI Studio and prompted the gemini-1.5-pro-exp-0801
model like this:
Convert this document to neatly styled semantic HTML
This worked surprisingly well. It output HTML for about half the document and then stopped, presumably hitting the output length limit, but a follow-up prompt of "and the rest" caused it to continue from where it stopped and run until the end.
Here's the result (with a banner I added at the top explaining that it's a conversion): Pipe-Syntax-In-SQL.html
I haven't compared the two completely, so I can't guarantee there are no omissions or mistakes.
The figures from the PDF aren't present - Gemini Pro output tags like <img src="figure1.png" alt="Figure 1: SQL syntactic clause order doesn't match semantic evaluation order. (From [25].)">
but did nothing to help me create those images.
Amusingly the document ends with <p>(A long list of references, which I won't reproduce here to save space.)</p>
rather than actually including the references from the paper!
So this isn't a perfect solution, but considering it took just the first prompt I could think of it's a very promising start. I expect someone willing to spend more than the couple of minutes I invested in this could produce a very useful HTML alternative version of the paper with the assistance of Gemini Pro.
One last amusing note: I posted a link to this to Hacker News a few hours ago. Just now when I searched Google for the exact title of the paper my HTML version was already the third result!
I've now added a <meta name="robots" content="noindex, follow">
tag to the top of the HTML to keep this unverified AI slop out of their search index. This is a good reminder of how much better HTML is than PDF for sharing information on the web!
Optimizing Datasette (and other weeknotes)
I’ve been working with Alex Garcia on an experiment involving using Datasette to explore FEC contributions. We currently have a 11GB SQLite database—trivial for SQLite to handle, but at the upper end of what I’ve comfortably explored with Datasette in the past.
[... 2,069 words]How to Get or Create in PostgreSQL (via) Get or create - for example to retrieve an existing tag record from a database table if it already exists or insert it if it doesn’t - is a surprisingly difficult operation.
Haki Benita uses it to illustrate a variety of interesting PostgreSQL concepts.
New to me: a pattern that runs INSERT INTO tags (name) VALUES (tag_name) RETURNING *;
and then catches the constraint violation and returns a record instead has a disadvantage at scale: “The table contains a dead tuple for every attempt to insert a tag that already existed” - so until vacuum runs you can end up with significant table bloat!
Haki’s conclusion is that the best solution relies on an upcoming feature coming in PostgreSQL 17: the ability to combine the MERGE operation with a RETURNING clause:
WITH new_tags AS (
MERGE INTO tags
USING (VALUES ('B'), ('C')) AS t(name)
ON tags.name = t.name
WHEN NOT MATCHED THEN
INSERT (name) VALUES (t.name)
RETURNING *
)
SELECT * FROM tags WHERE name IN ('B', 'C')
UNION ALL
SELECT * FROM new_tags;
I wonder what the best pattern for this in SQLite is. Could it be as simple as this?
INSERT OR IGNORE INTO tags (name) VALUES ('B'), ('C');
The SQLite INSERT documentation doesn't currently provide extensive details for INSERT OR IGNORE
, but there are some hints in this forum thread. This post by Rob Hoelz points out that INSERT OR IGNORE
will silently ignore any constraint violation, so INSERT INTO tags (tag) VALUES ('C'), ('D') ON CONFLICT(tag) DO NOTHING
may be a better option.
Build your own SQS or Kafka with Postgres (via) Anthony Accomazzo works on Sequin, an open source "message stream" (similar to Kafka) written in Elixir and Go on top of PostgreSQL.
This detailed article describes how you can implement message queue patterns on PostgreSQL from scratch, including this neat example using a CTE, returning
and for update skip locked
to retrieve $1
messages from the messages
table and simultaneously mark them with not_visible_until
set to $2
in order to "lock" them for processing by a client:
with available_messages as (
select seq
from messages
where not_visible_until is null
or (not_visible_until <= now())
order by inserted_at
limit $1
for update skip locked
)
update messages m
set
not_visible_until = $2,
deliver_count = deliver_count + 1,
last_delivered_at = now(),
updated_at = now()
from available_messages am
where m.seq = am.seq
returning m.seq, m.data;
DuckDB 1.0 (via) Six years in the making. The most significant feature in this milestone is stability of the file format: previous releases often required files to be upgraded to work with the new version.
This release also aspires to provide stability for both the SQL dialect and the C API, though these may still change with sufficient warning in the future.
Why SQLite Uses Bytecode (via) Brand new SQLite architecture documentation by D. Richard Hipp explaining the trade-offs between a bytecode based query plan and a tree of objects.
SQLite uses the bytecode approach, which provides an important characteristic that SQLite can very easily execute queries incrementally—stopping after each row, for example. This is more useful for a local library database than for a network server where the assumption is that the entire query will be executed before results are returned over the wire.
Optimizing SQLite for servers (via) Sylvain Kerkour’s comprehensive set of lessons learned running SQLite for server-based applications.
There’s a lot of useful stuff in here, including detailed coverage of the different recommended PRAGMA settings.
There was also a tip I haven’t seen before about “BEGIN IMMEDIATE” transactions:
“By default, SQLite starts transactions in DEFERRED mode: they are considered read only. They are upgraded to a write transaction that requires a database lock in-flight, when query containing a write/update/delete statement is issued.
The problem is that by upgrading a transaction after it has started, SQLite will immediately return a SQLITE_BUSY error without respecting the busy_timeout previously mentioned, if the database is already locked by another connection.
This is why you should start your transactions with BEGIN IMMEDIATE instead of only BEGIN. If the database is locked when the transaction starts, SQLite will respect busy_timeout.”
Them: Can you just quickly pull this data for me?
Me: Sure, let me just:
SELECT * FROM some_ideal_clean_and_pristine.table_that_you_think_exists
DuckDB as the New jq (via) The DuckDB CLI tool can query JSON files directly, making it a surprisingly effective replacement for jq. Paul Gross demonstrates the following query:
select license->>'key' as license, count(*) from 'repos.json' group by 1
repos.json
contains an array of {"license": {"key": "apache-2.0"}..}
objects. This example query shows counts for each of those licenses.
Endatabas (via) Endatabas is “an open source immutable database”—also described as “SQL document database with full history”.
It uses a variant of SQL which allows you to insert data into tables that don’t exist yet (they’ll be created automatically) then run standard select queries, joins etc. It maintains a full history of every record and supports the recent SQL standard “FOR SYSTEM_TIME AS OF” clause for retrieving historical records as they existed at a specified time (it defaults to the most recent versions).
It’s written in Common Lisp plus a bit of Rust, and includes Docker images for running the server and client libraries in JavaScript and Python. The on-disk storage format is Apache Arrow, the license is AGPL and it’s been under development for just over a year.
It’s also a document database: you can insert JSON-style nested objects directly into a table, and query them with path expressions like “select users.friends[1] from users where id = 123;”
They have a WebAssembly version and a nice getting started tutorial which you can try out directly in your browser.
Their “Why?” page lists full history, time travel queries, separation of storage from compute, schemaless tables and columnar storage as the five pillars that make up their product. I think it’s a really interesting amalgamation of ideas.
Announcing DuckDB 0.10.0.
Somewhat buried in this announcement: DuckDB has Fixed-Length Arrays now, along with array_cross_product(a1, a2)
, array_cosine_similarity(a1, a2)
and array_inner_product(a1, a2)
functions.
This means you can now use DuckDB to find related content (and other tricks) using vector embeddings!
Also notable:
DuckDB can now attach MySQL, Postgres, and SQLite databases in addition to databases stored in its own format. This allows data to be read into DuckDB and moved between these systems in a convenient manner, as attached databases are fully functional, appear just as regular tables, and can be updated in a safe, transactional manner.
SQL for Data Scientists in 100 Queries. New comprehensive SQLite SQL tutorial from Greg Wilson, author of Teaching Tech Together and founder of The Carpentries.
GPT in 500 lines of SQL (via) Utterly brilliant piece of PostgreSQL hackery by Alex Bolenok, who implements a full GPT-2 style language model in SQL on top of pg_vector. The final inference query is 498 lines long!
2023
Making PostgreSQL tick: New features in pg_cron (via) pg_cron adds cron-style scheduling directly to PostgreSQL. It's a pretty mature extension at this point, and recently gained the ability to schedule repeating tasks at intervals as low as every 1s.
The examples in this post are really informative. I like this example, which cleans up the ever-growing cron.job_run_details table by using pg_cron itself to run the cleanup:
SELECT cron.schedule('delete-job-run-details', '0 12 * * *', $$DELETE FROM cron.job_run_details WHERE end_time < now() - interval '3 days'$$);
pg_cron can be used to schedule functions written in PL/pgSQL, which is a great example of the kind of DSL that I used to avoid but I'm now much happier to work with because I know GPT-4 can write basic examples for me and help me understand exactly what unfamiliar code is doing.
Upsert in SQL (via) Anton Zhiyanov is currently on a one-man quest to write detailed documentation for all of the fundamental SQL operations, comparing and contrasting how they work across multiple engines, generally with interactive examples.
Useful tips in here on why “insert... on conflict” is usually a better option than “insert or replace into” because the latter can perform a delete and then an insert, firing triggers that you may not have wanted to be fired.
Online gradient descent written in SQL (via) Max Halford trains an online gradient descent model against two years of AAPL stock data using just a single advanced SQL query. He built this against DuckDB—I tried to replicate his query in SQLite and it almost worked, but it gave me a “recursive reference in a subquery” error that I was unable to resolve.
2022
konstantint/SKompiler (via) A tool for compiling trained SKLearn models into other representations —including SQL queries and Excel formulas. I’ve been pondering the most light-weight way to package a simple machine learning model as part of a larger application without needing to bundle heavy dependencies, this set of techniques looks ideal!
[SQLite is] a database that in full-stack culture has been relegated to "unit test database mock" for about 15 years that is (1) surprisingly capable as a SQL engine, (2) the simplest SQL database to get your head around and manage, and (3) can embed directly in literally every application stack, which is especially interesting in latency-sensitive and globally-distributed applications.
Reason (3) is clearly our ulterior motive here, so we're not disinterested: our model user deploys a full-stack app (Rails, Elixir, Express, whatever) in a bunch of regions around the world, hoping for sub-100ms responses for users in most places around the world. Even within a single data center, repeated queries to SQL servers can blow that budget. Running an in-process SQL server neatly addresses it.
Efficient Pagination Using Deferred Joins (via) Surprisingly simple trick for speeding up deep OFFSET x LIMIT y pagination queries, which get progressively slower as you paginate deeper into the data. Instead of applying them directly, apply them to a “select id from ...” query to fetch just the IDs, then either use a join or run a separate “select * from table where id in (...)” query to fetch the full records for that page.
The Checkered Flag Diagram for visualizing SQL joins. I really like this alternative to Venn diagrams for showing the difference between different types of SQL join (left join, right join, cross join etc).
Joining CSV files in your browser using Datasette Lite
I added a new feature to Datasette Lite—my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them—including joins across data from multiple files.
[... 546 words]How Materialize and other databases optimize SQL subqueries. Jamie Brandon offers a survey of the state-of-the-art in optimizing correlated subqueries, across a number of different database engines.
2021
SQLModel. A new project by FastAPI creator Sebastián Ramírez: SQLModel builds on top of both SQLAlchemy and Sebastián’s Pydantic validation library to provide a new ORM that’s designed around Python 3’s optional typing. The real brilliance here is that a SQLModel subclass is simultaneously a valid SQLAlchemy ORM model AND a valid Pydantic validation model, saving on duplicate code by allowing the same class to be used both for form/API validation and for interacting with the database.
Datasette on Codespaces, sqlite-utils API reference documentation and other weeknotes
This week I broke my streak of not sending out the Datasette newsletter, figured out how to use Sphinx for Python class documentation, worked out how to run Datasette on GitHub Codespaces, implemented Datasette column metadata and got tantalizingly close to a solution for an elusive Datasette feature.
[... 2,164 words]Bare columns in an aggregate queries. This is a really nice SQL tweak implemented in SQLite: If you run a query like “SELECT a, b, max(c) FROM tab1 GROUP BY a” SQLite will find the row with the highest value for c and use the columns of that row as the returned values for the other columns mentioned in the query.
Django SQL Dashboard 1.0
Earlier this week I released Django SQL Dashboard 1.0. I also just released 1.0.1, with a bug fix for PostgreSQL 10 contributed by Ryan Cheley.
[... 629 words]