Simon Willison on ops

19 posts tagged “ops”

2025

GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it. GitHub Issues got a significant search upgrade back in January. Deborah Digges provides some behind the scene details about how it works and how they rolled it out.

The signature new feature is complex boolean logic: you can now search for things like is:issue state:open author:rileybroughten (type:Bug OR type:Epic), up to five levels of nesting deep.

Queries are parsed into an AST using the Ruby parslet PEG grammar library. The AST is then compiled into a nested Elasticsearch bool JSON query.

GitHub Issues search deals with around 2,000 queries a second so robust testing is extremely important! The team rolled it out invisibly to 1% of live traffic, running the new implementation via a queue and competing the number of results returned to try and spot any degradations compared to the old production code.

# 26th May 2025, 7:23 am / elasticsearch, github, ops, parsing, ruby, scaling, search, github-issues

I designed Dropbox's storage system and modeled its durability. Durability numbers (11 9's etc) are meaningless because competent providers don't lose data because of disk failures, they lose data because of bugs and operator error. [...]

The best thing you can do for your own durability is to choose a competent provider and then ensure you don't accidentally delete or corrupt own data on it:

Ideally never mutate an object in S3, add a new version instead.

Never live-delete any data. Mark it for deletion and then use a lifecycle policy to clean it up after a week.

This way you have time to react to a bug in your own stack.

— James Cowling

# 14th May 2025, 3:49 am / ops, s3, software-architecture

2024

Making Machines Move. Another deep technical dive into Fly.io infrastructure from Thomas Ptacek, this time describing how they can quickly boot up an instance with a persistent volume on a new host (for things like zero-downtime deploys) using a block-level cloning operation, so the new instance gets a volume that becomes accessible instantly, serving proxied blocks of data until the new volume has been completely migrated from the old host.

# 30th July 2024, 9:45 pm / ops, thomas-ptacek, zero-downtime, fly

2023

Upgrading GitHub.com to MySQL 8.0 (via) I love a good zero-downtime upgrade story, and this is a fine example of the genre. GitHub spent a year upgrading MySQL from 5.7 to 8 across 1200+ hosts, covering 300+ TB that was serving 5.5 million queries per second. The key technique was extremely carefully managed replication, plus tricks like leaving enough 5.7 replicas available to handle a rollback should one be needed.

# 10th December 2023, 8:36 pm / github, mysql, ops, replication, zero-downtime

One of my fav early Stripe rules was from incident response comms: do not publicly blame an upstream provider. We chose the provider, so own the results—and use any pain from that as extra motivation to invest in redundant services, go direct to the source, etc.

— Michael Schade

# 5th November 2023, 10:53 pm / ops, stripe

Database Migrations. Vadim Kravcenko provides a useful, in-depth description of the less obvious challenges of applying database migrations successfully. Vadim uses and likes Django’s migrations (as do I) but notes that running them at scale still involves a number of thorny challenges.

The biggest of these, which I’ve encountered myself multiple times, is that if you want truly zero downtime deploys you can’t guarantee that your schema migrations will be deployed at the exact same instant as changes you make to your application code.

This means all migrations need to be forward-compatible: you need to apply a schema change in a way that your existing code will continue to work error-free, then ship the related code change as a separate operation.

Vadim describes what this looks like in detail for a number of common operations: adding a field, removing a field and changing a field that has associated business logic implications. He also discusses the importance of knowing when to deploy a dual-write strategy.

# 1st October 2023, 11:55 pm / databases, django, migrations, ops, zero-downtime

Shamir Secret Sharing (via) Cracking war story from Max Levchin about the early years of PayPal, in which he introduces an implementation of Shamir Secret Sharing to encrypt their master payment credential table... and then finds that the 3-of-8 passwords needed to decrypt it and bring the site back online don’t appear to work.

# 11th August 2023, 3:48 pm / encryption, ops, paypal

MRSK. A new open source web application deployment tool from 37signals, developed to help migrate their Hey webmail app out of the cloud and onto their own managed hardware. The key feature is one that I care about deeply: it enables zero-downtime deploys by running all traffic through a Traefik reverse proxy in a way that allows requests to be paused while a new deployment is going out—so end users get a few seconds delay on their HTTP requests before being served by the replaced application.

# 29th April 2023, 11:54 pm / 37-signals, deployment, ops, zero-downtime, traefik

2022

Over the years, across multiple deployments, DynamoDB has learned that it’s not just the end state and the start state that matter; there could be times when the newly deployed software doesn’t work and needs a rollback. The rolled-back state might be different from the initial state of the software. The rollback procedure is often missed in testing and can lead to customer impact. DynamoDB runs a suite of upgrade and downgrade tests at a component level before every deployment. Then, the software is rolled back on purpose and tested by running functional tests. DynamoDB has found this process valuable for catching issues that otherwise would make it hard to rollback if needed.

— Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

# 5th September 2022, 6:49 pm / aws, ops

Roblox Return to Service 10/28-10/31 2021 (via) A particularly good example of a public postmortem on an outage. Roblox was down for 72 hours last year, as a result of an extremely complex set of circumstances which took a lot of effort to uncover. It’s interesting to think through what kind of monitoring you would need to have in place to help identify the root cause of this kind of issue.

# 21st January 2022, 4:41 pm / ops, observability, postmortem

2019

Operations engineering does not consist of firefighting your shitty software, it is the science of delivering value to users.

— Charity Majors

# 14th February 2019, 1:12 am / ops, charity-majors

2010

Setting up Munin on Ubuntu. Useful guide to setting up my favourite graphing/monitoring tool for personal projects.

# 1st September 2010, 2:05 pm / ops, sysadmin, ubuntu, recovered, munin

Zero-downtime Redis upgrade discussion. GitHub have a short window of scheduled downtime in order to upgrade their Redis server. I asked in their comments if they’d considered trying to run the upgrade with no downtime at all using Redis replication, and Ryan Tomayko has posted some interesting replies.

# 28th May 2010, 2:50 pm / github, ops, redis, ryan-tomayko, upgrades, recovered, zero-downtime

Linux performance basics. This kind of Linux knowledge is rapidly becoming a key skill for server-side web development.

# 24th January 2010, 1:50 pm / jonathan-ellis, linux, ops, performance, sysadmin

2009

Round-robin Django setup with nginx. An nginx trick I didn’t know: a low proxy_connect_timeout value (e.g. 2 seconds) combined with the proxy_next_upstream setting means that if one of your backends breaks a user won’t even see an error, they’ll just have a short delay before getting a response from a working server.

# 21st December 2009, 3:43 pm / django, load-balancing, nginx, ops, sysadmin

Announcing Kong: A server description and deployment testing tool. An ultra simple website monitoring tool written in Django which makes it easy to manage a list of Twill scripts for testing different sites. It was developed at the Lawrence Journal-World—Eric showed me a demo if this a year or so ago and I’ve been hoping they would open source it.

# 18th November 2009, 12:47 pm / django, eric-holscher, kong, monitoring, open-source, ops

Using Graphics Card Memory as Swap (via) Interesting idea: “Graphic cards contain a lot of very fast RAM, typically between 64 and 512 MB. With Linux, it’s possible to use it as swap space, or even as RAM disk.”

# 3rd November 2009, 11:01 am / graphicscards, linux, memory, ops, performance, ram, sysadmin

I loathe [hardware load balancers]. They’re expensive, restrictive, slow, and generally cause you a lot more pain and suffering than they’re worth. At my last job, one of my projects was to convert most of one of our existing clusters from a load-balancing appliance to use keepalived. Why would we do this? Because the $100k worth of appliance wasn’t capable of doing the job that $15k worth of commodity hardware and an installation of keepalived were handling with ease.

— Matt Palmer

# 3rd November 2009, 10:45 am / keepalived, load-balancing, matt-palmer, ops, sysadmin

2008

We’re all ops people now. Edd’s experience reflects my own: the kind of systems I’m building these days involve way more than just development, they often involve significant sysadmin type skills as well. Desperately need to get better at that stuff.

# 20th June 2008, 9:02 pm / edddumbill, ops, sysadmin

Simon Willison’s Weblog