Simon Willison on clickhouse

6 posts tagged “clickhouse”

2025

ClickHouse gets lazier (and faster): Introducing lazy materialization (via) Tom Schreiber describe's the latest optimization in ClickHouse, and in the process explores a whole bunch of interesting characteristics of columnar datastores generally.

As I understand it, the new "lazy materialization" feature means that if you run a query like this:

select id, big_col1, big_col2
from big_table order by rand() limit 5

Those big_col1 and big_col2 columns won't be read from disk for every record, just for the five that are returned. This can dramatically improve the performance of queries against huge tables - for one example query ClickHouse report a drop from "219 seconds to just 139 milliseconds—with 40× less data read and 300× lower memory usage."

I'm linking to this mainly because the article itself is such a detailed discussion of columnar data patterns in general. It caused me to update my intuition for how queries against large tables can work on modern hardware. This query for example:

SELECT helpful_votes
FROM amazon.amazon_reviews
ORDER BY helpful_votes DESC
LIMIT 3;

Can run in 70ms against a 150 million row, 70GB table - because in a columnar database you only need to read that helpful_votes integer column which adds up to just 600MB of data, and sorting 150 million integers on a decent machine takes no time at all.

# 22nd April 2025, 5:05 pm / databases, sql, clickhouse

2024

GitHub Public repo history tool (via) I built this Observable Notebook to run queries against the GH Archive (via ClickHouse) to try to answer questions about repository history—in particular, were they ever made public as opposed to private in the past.

It works by combining together PublicEvent event (moments when a private repo was made public) with the most recent PushEvent event for each of a user’s repositories.

# 20th March 2024, 9:56 pm / github, projects, observable, clickhouse

2023

chDB (via) This is a really interesting development: chDB offers “an embedded SQL OLAP Engine” as a Python package, which you can install using “pip install chdb”. What you’re actually getting is a wrapper around ClickHouse—it’s almost like ClickHouse has been repackaged into an embedded database similar to SQLite.

# 24th October 2023, 11:04 pm / python, clickhouse

2022

Lesser Known Features of ClickHouse (via) I keep hearing positive noises about ClickHouse. I learned about a whole bunch of capabilities from this article—including that ClickHouse can directly query tables that are stored in SQLite or PostgreSQL.

# 31st May 2022, 7:48 pm / postgresql, sqlite, clickhouse

2021

Clickhouse on Cloud Run (via) Alex Reid figured out how to run Clickhouse against read-only baked data on Cloud Run last year, and wrote up some comprehensive notes.

# 29th July 2021, 6:07 am / cloudrun, baked-data, clickhouse

Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask) (via) ClickHouse by Yandex is an open source column-oriented data warehouse, designed to run analytical queries against TBs of data. They've loaded the full GitHub Archive of events since 2011 into a public instance, which is a great way of both exploring GitHub activity and trying out ClickHouse. Here's a query I just ran that shows number of watch events per year, for example:

SELECT toYear(created_at) as yyyy, count()
FROM github_events
WHERE event_type = 'WatchEvent' group by yyyy

# 5th January 2021, 1:02 am / analytics, github, sql, big-data, clickhouse

Simon Willison’s Weblog