Simon Willison’s Weblog

Subscribe

4 items tagged “haki-benita”

2024

How to Get or Create in PostgreSQL (via) Get or create - for example to retrieve an existing tag record from a database table if it already exists or insert it if it doesn’t - is a surprisingly difficult operation.

Haki Benita uses it to illustrate a variety of interesting PostgreSQL concepts.

New to me: a pattern that runs INSERT INTO tags (name) VALUES (tag_name) RETURNING *; and then catches the constraint violation and returns a record instead has a disadvantage at scale: “The table contains a dead tuple for every attempt to insert a tag that already existed” - so until vacuum runs you can end up with significant table bloat!

Haki’s conclusion is that the best solution relies on an upcoming feature coming in PostgreSQL 17: the ability to combine the MERGE operation with a RETURNING clause:

WITH new_tags AS (
    MERGE INTO tags
    USING (VALUES ('B'), ('C')) AS t(name)
    ON tags.name = t.name
WHEN NOT MATCHED THEN
    INSERT (name) VALUES (t.name)
    RETURNING *
)
SELECT * FROM tags WHERE name IN ('B', 'C')
    UNION ALL
SELECT * FROM new_tags;

I wonder what the best pattern for this in SQLite is. Could it be as simple as this?

INSERT OR IGNORE INTO tags (name) VALUES ('B'), ('C');

The SQLite INSERT documentation doesn't currently provide extensive details for INSERT OR IGNORE, but there are some hints in this forum thread. This post by Rob Hoelz points out that INSERT OR IGNORE will silently ignore any constraint violation, so INSERT INTO tags (tag) VALUES ('C'), ('D') ON CONFLICT(tag) DO NOTHING may be a better option.

# 5th August 2024, 3:15 pm / postgresql, sql, sqlite, haki-benita

Fastest Way to Read Excel in Python (via) Haki Benita produced a meticulously researched and written exploration of the options for reading a large Excel spreadsheet into Python. He explored Pandas, Tablib, Openpyxl, shelling out to LibreOffice, DuckDB and python-calamine (a Python wrapper of a Rust library). Calamine was the winner, taking 3.58s to read 500,00 rows—compared to Pandas in last place at 32.98s.

# 3rd January 2024, 8:04 pm / excel, pandas, python, rust, duckdb, haki-benita

2021

Practical SQL for Data Analysis (via) This is a really great SQL tutorial: it starts with the basics, but quickly moves on to a whole array of advanced PostgreSQL techniques - CTEs, window functions, efficient sampling, rollups, pivot tables and even linear regressions executed directly in the database using regr_slope(), regr_intercept() and regr_r2(). I picked up a whole bunch of tips for things I didn't know you could do with PostgreSQL here.

# 4th May 2021, 3:11 am / postgresql, sql, haki-benita

2018

Optimizing Django Admin Paginator. The Django admin paginator uses a count(*) to calculate the total number of rows, so it knows how many pages to display. This makes it unpleasantly slow over large datasets. Haki Benita has an ingenious solution: drop in a custom paginator which uses the PostgreSQL “SET LOCAL statement_timeout TO 200” statement first, then if a timeout error is raised returns 9999999999 as the count instead. This means small tables get accurate page counts and giant tables load display in the admin within a reasonable time period.

# 6th November 2018, 6:17 pm / django, postgresql, haki-benita