2 items tagged “postmortem”
2022
Roblox Return to Service 10/28-10/31 2021 (via) A particularly good example of a public postmortem on an outage. Roblox was down for 72 hours last year, as a result of an extremely complex set of circumstances which took a lot of effort to uncover. It’s interesting to think through what kind of monitoring you would need to have in place to help identify the root cause of this kind of issue.
2018
October 21 post-incident analysis (via) Legitimately fascinating post-mortem by GitHub. They run database masters in multiple data centers with raft for leader election... but when they had an unexpected network split between east and west coast they ended up with several seconds of write that had not been correctly replicated. Cleaning up the resulting mess took the best part of 24 hours! Distributed systems are hard.