129 items tagged “postgresql”
2024
Dashboard: Tools. I used Django SQL Dashboard to spin up a dashboard that shows all of the URLs to my tools.simonwillison.net site that I've shared on my blog so far. It uses this (Claude assisted) regular expression in a PostgreSQL SQL query:
select distinct on (tool_url)
unnest(regexp_matches(
body,
'(https://tools\.simonwillison\.net/[^<"\s)]+)',
'g'
)) as tool_url,
'https://simonwillison.net/' || left(type, 1) || '/' || id as blog_url,
title,
date(created) as created
from content
I've been really enjoying having a static hosting platform (it's GitHub Pages serving my simonw/tools repo) that I can use to quickly deploy little HTML+JavaScript interactive tools and demos.
PostgreSQL 17: SQL/JSON is here! (via) Hubert Lubaczewski dives into the new JSON features added in PostgreSQL 17, released a few weeks ago on the 26th of September. This is the latest in his long series of similar posts about new PostgreSQL features.
The features are based on the new SQL:2023 standard from June 2023. If you want to actually read the specification for SQL:2023 it looks like you have to buy a PDF from ISO for 194 Swiss Francs (currently $226). Here's a handy summary by Peter Eisentraut: SQL:2023 is finished: Here is what's new.
There's a lot of neat stuff in here. I'm particularly interested in the json_table()
table-valued function, which can convert a JSON string into a table with quite a lot of flexibility. You can even specify a full table schema as part of the function call:
SELECT * FROM json_table(
'[{"a":10,"b":20},{"a":30,"b":40}]'::jsonb,
'$[*]'
COLUMNS (
id FOR ORDINALITY,
column_a int4 path '$.a',
column_b int4 path '$.b',
a int4,
b int4,
c text
)
);
SQLite has solid JSON support already and often imitates PostgreSQL features, so I wonder if we'll see an update to SQLite that reflects some aspects of this new syntax.
Python Developers Survey 2023 Results (via) The seventh annual Python survey is out. Here are the things that caught my eye or that I found surprising:
25% of survey respondents had been programming in Python for less than a year, and 33% had less than a year of professional experience.
37% of Python developers reported contributing to open-source projects last year - a new question for the survey. This is delightfully high!
6% of users are still using Python 2. The survey notes:
Almost half of Python 2 holdouts are under 21 years old and a third are students. Perhaps courses are still using Python 2?
In web frameworks, Flask and Django neck and neck at 33% each, but FastAPI is a close third at 29%! Starlette is at 6%, but that's an under-count because it's the basis for FastAPI.
The most popular library in "other framework and libraries" was BeautifulSoup with 31%, then Pillow 28%, then OpenCV-Python at 22% (wow!) and Pydantic at 22%. Tkinter had 17%. These numbers are all a surprise to me.
pytest scores 52% for unit testing, unittest
from the standard library just 25%. I'm glad to see pytest
so widely used, it's my favourite testing tool across any programming language.
The top cloud providers are AWS, then Google Cloud Platform, then Azure... but PythonAnywhere (11%) took fourth place just ahead of DigitalOcean (10%). And Alibaba Cloud is a new entrant in sixth place (after Heroku) with 4%. Heroku's ending of its free plan dropped them from 14% in 2021 to 7% now.
Linux and Windows equal at 55%, macOS is at 29%. This was one of many multiple-choice questions that could add up to more than 100%.
In databases, SQLite usage was trending down - 38% in 2021 to 34% for 2023, but still in second place behind PostgreSQL, stable at 43%.
The survey incorporates quotes from different Python experts responding to the numbers, it's worth reading through the whole thing.
My goal is to keep SQLite relevant and viable through the year 2050. That's a long time from now. If I knew that standard SQL was not going to change any between now and then, I'd go ahead and make non-standard extensions that allowed for FROM-clause-first queries, as that seems like a useful extension. The problem is that standard SQL will not remain static. Probably some future version of "standard SQL" will support some kind of FROM-clause-first query format. I need to ensure that whatever SQLite supports will be compatible with the standard, whenever it drops. And the only way to do that is to support nothing until after the standard appears.
When will that happen? A month? A year? Ten years? Who knows.
I'll probably take my cue from PostgreSQL. If PostgreSQL adds support for FROM-clause-first queries, then I'll do the same with SQLite, copying the PostgreSQL syntax. Until then, I'm afraid you are stuck with only traditional SELECT-first queries in SQLite.
New Django {% querystring %} template tag. Django 5.1 came out last week and includes a neat new template tag which solves a problem I've faced a bunch of times in the past.
{% querystring color="red" size="S" %}
Adds ?color=red&size=S
to the current URL - keeping any other existing parameters and replacing the current value for color
or size
if it's already set.
{% querystring color=None %}
Removes the ?color=
parameter if it is currently set.
If the value passed is a list it will append ?color=red&color=blue
for as many items as exist in the list.
You can access values in variables and you can also assign the result to a new template variable rather than outputting it directly to the page:
{% querystring page=page.next_page_number as next_page %}
Other things that caught my eye in Django 5.1:
- PostgreSQL connection pools.
- The new LoginRequiredMiddleware for making every page in an application require login.
- The SQLite database backend now accepts init_command for settings things like
PRAGMA cache_size=2000
on new connections. - SQLite can also be passed
"transaction_mode": "IMMEDIATE"
to configure the behaviour of transactions.
SQL Injection Isn’t Dead: Smuggling Queries at the Protocol Level (via) PDF slides from a presentation by Paul Gerste at DEF CON 32. It turns out some databases have vulnerabilities in their binary protocols that can be exploited by carefully crafted SQL queries.
Paul demonstrates an attack against PostgreSQL (which works in some but not all of the PostgreSQL client libraries) which uses a message size overflow, by embedding a string longer than 4GB (2**32 bytes) which overflows the maximum length of a string in the underlying protocol and writes data to the subsequent value. He then shows a similar attack against MongoDB.
The current way to protect against these attacks is to ensure a size limit on incoming requests. This can be more difficult than you may expect - Paul points out that alternative paths such as WebSockets might bypass limits that are in place for regular HTTP requests, plus some servers may apply limits before decompression, allowing an attacker to send a compressed payload that is larger than the configured limit.
How to Get or Create in PostgreSQL (via) Get or create - for example to retrieve an existing tag record from a database table if it already exists or insert it if it doesn’t - is a surprisingly difficult operation.
Haki Benita uses it to illustrate a variety of interesting PostgreSQL concepts.
New to me: a pattern that runs INSERT INTO tags (name) VALUES (tag_name) RETURNING *;
and then catches the constraint violation and returns a record instead has a disadvantage at scale: “The table contains a dead tuple for every attempt to insert a tag that already existed” - so until vacuum runs you can end up with significant table bloat!
Haki’s conclusion is that the best solution relies on an upcoming feature coming in PostgreSQL 17: the ability to combine the MERGE operation with a RETURNING clause:
WITH new_tags AS (
MERGE INTO tags
USING (VALUES ('B'), ('C')) AS t(name)
ON tags.name = t.name
WHEN NOT MATCHED THEN
INSERT (name) VALUES (t.name)
RETURNING *
)
SELECT * FROM tags WHERE name IN ('B', 'C')
UNION ALL
SELECT * FROM new_tags;
I wonder what the best pattern for this in SQLite is. Could it be as simple as this?
INSERT OR IGNORE INTO tags (name) VALUES ('B'), ('C');
The SQLite INSERT documentation doesn't currently provide extensive details for INSERT OR IGNORE
, but there are some hints in this forum thread. This post by Rob Hoelz points out that INSERT OR IGNORE
will silently ignore any constraint violation, so INSERT INTO tags (tag) VALUES ('C'), ('D') ON CONFLICT(tag) DO NOTHING
may be a better option.
Build your own SQS or Kafka with Postgres (via) Anthony Accomazzo works on Sequin, an open source "message stream" (similar to Kafka) written in Elixir and Go on top of PostgreSQL.
This detailed article describes how you can implement message queue patterns on PostgreSQL from scratch, including this neat example using a CTE, returning
and for update skip locked
to retrieve $1
messages from the messages
table and simultaneously mark them with not_visible_until
set to $2
in order to "lock" them for processing by a client:
with available_messages as (
select seq
from messages
where not_visible_until is null
or (not_visible_until <= now())
order by inserted_at
limit $1
for update skip locked
)
update messages m
set
not_visible_until = $2,
deliver_count = deliver_count + 1,
last_delivered_at = now(),
updated_at = now()
from available_messages am
where m.seq = am.seq
returning m.seq, m.data;
The economics of a Postgres free tier (via) Xata offer a hosted PostgreSQL service with a generous free tier (15GB of volume). I'm very suspicious of free tiers that don't include a detailed breakdown of the unit economics... and in this post they've described exactly that, in great detail.
The trick is that they run their free tier on shared clusters - with each $630/month cluster supporting 2,000 free instances for $0.315 per instance per month. Then inactive databases get downgraded to even cheaper auto-scaling clusters that can host 20,000 databases for $180/month (less than 1c each).
They also cover the volume cost of $0.10/GB/month - so up to $1.50/month per free instance, but most instances only use a small portion of that space.
It's reassuring to see this spelled out in so much detail.
Serving a billion web requests with boring code (via) Bill Mill provides a deep retrospective from his work helping build a relaunch of the medicare.gov/plan-compare site.
It's a fascinating case study of the choose boring technology mantra put into action. The "boring" choices here were PostgreSQL, Go and React, all three of which are so widely used and understood at this point that you're very unlikely to stumble into surprises with them.
Key goals for the site were accessibility, in terms of users, devices and performance. Despite best efforts:
The result fell prey after a few years to a common failure mode of react apps, and became quite heavy and loaded somewhat slowly.
I've seen this pattern myself many times over, and I'd love to understand why. React itself isn't a particularly large dependency but somehow it always seems to lead to architectural bloat over time. Maybe that's more of an SPA thing than something that's specific to React.
Loads of other interesting details in here. The ETL details - where brand new read-only RDS databases were spun up every morning after a four hour build process - are particularly notable.
Val Vibes: Semantic search in Val Town. A neat case-study by JP Posma on how Val Town's developers can use Val Town Vals to build prototypes of new features that later make it into Val Town core.
This one explores building out semantic search against Vals using OpenAI embeddings and the PostgreSQL pgvector extension.
How we built JSR (via) Really interesting deep dive by Luca Casonato into the engineering behind the new JSR alternative JavaScript package registry launched recently by Deno.
The backend uses PostgreSQL and a Rust API server hosted on Google Cloud Run.
The frontend uses Fresh, Deno’s own server-side JavaScript framework which leans heavily in the concept of “islands”—a progressive enhancement technique where pages are rendered on the server and small islands of interactivity are added once the page has loaded.
How Figma’s databases team lived to tell the scale (via) The best kind of scaling war story:
"Figma’s database stack has grown almost 100x since 2020. [...] In 2020, we were running a single Postgres database hosted on AWS’s largest physical instance, and by the end of 2022, we had built out a distributed architecture with caching, read replicas, and a dozen vertically partitioned databases."
I like the concept of "colos", their internal name for sharded groups of related tables arranged such that those tables can be queried using joins.
Also smart: separating the migration into "logical sharding" - where queries all still run against a single database, even though they are logically routed as if the database was already sharded - followed by "physical sharding" where the data is actually copied to and served from the new database servers.
Logical sharding was implemented using PostgreSQL views, which can accept both reads and writes:
CREATE VIEW table_shard1 AS SELECT * FROM table
WHERE hash(shard_key) >= min_shard_range AND hash(shard_key) < max_shard_range)
The final piece of the puzzle was DBProxy, a custom PostgreSQL query proxy written in Go that can parse the query to an AST and use that to decide which shard the query should be sent to. Impressively it also has a scatter-gather mechanism, so select * from table
can be sent to all shards at once and the results combined back together again.
PGlite (via) PostgreSQL compiled for WebAssembly and turned into a very neat JavaScript library. Previous attempts at running PostgreSQL in WASM have worked by bundling a full Linux virtual machine - PGlite just bundles a compiled PostgreSQL itself, which brings the size down to an impressive 3.7MB gzipped.
I built this interactive demo of PGlite using Observable Framework, source code here.
Announcing DuckDB 0.10.0.
Somewhat buried in this announcement: DuckDB has Fixed-Length Arrays now, along with array_cross_product(a1, a2)
, array_cosine_similarity(a1, a2)
and array_inner_product(a1, a2)
functions.
This means you can now use DuckDB to find related content (and other tricks) using vector embeddings!
Also notable:
DuckDB can now attach MySQL, Postgres, and SQLite databases in addition to databases stored in its own format. This allows data to be read into DuckDB and moved between these systems in a convenient manner, as attached databases are fully functional, appear just as regular tables, and can be updated in a safe, transactional manner.
pgroll (via) “Zero-downtime, reversible, schema migrations for Postgres”
I love this kind of thing. This one is has a really interesting design: you define your schema modifications (adding/dropping columns, creating tables etc) using a JSON DSL, then apply them using a Go binary.
When you apply a migration the tool first creates a brand new PostgreSQL schema (effectively a whole new database) which imitates your new schema design using PostgreSQL views. You can then point your applications that have been upgraded to the new schema at it, using the PostgreSQL search_path setting.
Old applications can continue talking to the previous schema design, giving you an opportunity to roll out a zero-downtime deployment of the new code.
Once your application has upgraded and the physical rows in the database have been transformed to the new schema you can run a --continue command to make the final destructive changes and drop the mechanism that simulates both schema designs at once.
GPT in 500 lines of SQL (via) Utterly brilliant piece of PostgreSQL hackery by Alex Bolenok, who implements a full GPT-2 style language model in SQL on top of pg_vector. The final inference query is 498 lines long!
2023
Database generated columns: GeoDjango & PostGIS. Paolo Melchiorre advocated for the inclusion of generated columns, one of the biggest features in Django 5.0. Here he provides a detailed tutorial showing how they can be used with PostGIS to create database tables that offer columns such as geohash that are automatically calculated from other columns in the table.
Making PostgreSQL tick: New features in pg_cron (via) pg_cron adds cron-style scheduling directly to PostgreSQL. It's a pretty mature extension at this point, and recently gained the ability to schedule repeating tasks at intervals as low as every 1s.
The examples in this post are really informative. I like this example, which cleans up the ever-growing cron.job_run_details table by using pg_cron itself to run the cleanup:
SELECT cron.schedule('delete-job-run-details', '0 12 * * *', $$DELETE FROM cron.job_run_details WHERE end_time < now() - interval '3 days'$$);
pg_cron can be used to schedule functions written in PL/pgSQL, which is a great example of the kind of DSL that I used to avoid but I'm now much happier to work with because I know GPT-4 can write basic examples for me and help me understand exactly what unfamiliar code is doing.
Upsert in SQL (via) Anton Zhiyanov is currently on a one-man quest to write detailed documentation for all of the fundamental SQL operations, comparing and contrasting how they work across multiple engines, generally with interactive examples.
Useful tips in here on why “insert... on conflict” is usually a better option than “insert or replace into” because the latter can perform a delete and then an insert, firing triggers that you may not have wanted to be fired.
PostgreSQL Lock Conflicts (via) I absolutely love how extremely specific and niche this documentation site is. It details every single lock that PostgreSQL implements, and shows exactly which commands acquire that lock. That’s everything. I can imagine this becoming absurdly useful at extremely infrequent intervals for advanced PostgreSQL work.
Writing a chat application in Django 4.2 using async StreamingHttpResponse, Server-Sent Events and PostgreSQL LISTEN/NOTIFY (via) Excellent tutorial by Víðir Valberg Guðmundsson on implementing chat with server-sent events using the newly async-capable StreamingHttpResponse from Django 4.2x.
He uses PostgreSQL’a LISTEN/NOTIFY mechanism which can be used asynchronously in psycopg3—at the cost of a separate connection per user of the chat.
The article also covers how to use the Last-Event-ID header to implement reconnections in server-sent events, transmitting any events that may have been missed during the time that the connection was dropped.
Big Opportunities in Small Data
I gave an invited keynote at Citus Con 2023, the PostgreSQL conference. Below is the abstract, video, slides and links from the presentation.
[... 385 words]Quicker serverless Postgres connections. Neon provide “serverless PostgreSQL”—autoscaling, managed PostgreSQL optimized for use with serverless hosting environments. A neat capability they provide is the ability to connect to a PostgreSQL server via WebSockets, which means their database can be used from environments such as Cloudflare workers which don’t have the ability to use a standard TCP database connection. This article describes some clever tricks they used to make establishing new connections via WebSockets more efficient, using the least possible number of network round-trips.
2022
Scaling Mastodon: The Compendium (via) Hazel Weakly’s collection of notes on scaling Mastodon, covering PostgreSQL, Sidekiq, Redis, object storage and more.
Querying Postgres Tables Directly From DuckDB (via) I learned a lot of interesting PostgreSQL tricks from this write-up of the new DuckDB feature that allows it to run queries against PostgreSQL servers directly. It works using COPY (SELECT ...) TO STDOUT (FORMAT binary) which writes rows to the protocol stream in efficient binary format, but splits the table being read into parallel fetches against page ranges and uses SET TRANSACTION SNAPSHOT ... in those parallel queries to ensure they see the same transactional snapshot of the database.
Crunchy Data: Learn Postgres at the Playground (via) Crunchy Data have a new PostgreSQL tutorial series, with a very cool twist: they have a build of PostgreSQL compiled to WebAssembly which runs in the browser, so each tutorial is accompanied by a working psql terminal that lets you try out the tutorial contents interactively. It even has support for PostGIS, though that particular tutorial has to load 80MB of assets in order to get it to work!
Lesser Known Features of ClickHouse (via) I keep hearing positive noises about ClickHouse. I learned about a whole bunch of capabilities from this article—including that ClickHouse can directly query tables that are stored in SQLite or PostgreSQL.
Glue code to quickly copy data from one Postgres table to another (via) The Python script that Retool used to migrate 4TB of data between two PostgreSQL databases. I find the structure of this script really interesting—it uses Python to spin up a queue full of ID ranges to be transferred and then starts some threads, but then each thread shells out to a command that runs “psql COPY (SELECT ...) TO STDOUT” and pipes the result to “psql COPY xxx FROM STDIN”. Clearly this works really well (“saturate the database’s hardware capacity” according to a comment on HN), and neatly sidesteps any issues with Python’s GIL.
Postgres Auditing in 150 lines of SQL (via) I’ve run up against the problem of tracking changes made to rows within a database table so many times, and I still don’t have a preferred solution. This approach to it looks very neat: it uses PostgreSQL triggers to populate a single audit table (as opposed to one audit table per tracked table) and records the previous and current column values for the row using jsonb.