Our Error Logging Service Went Down So We Had No Idea Everything Else Was Down Too
Error logging silently went down. So did everything else. We found out when the CEO called because the website was white.
I Ran a Load Test Against Production Thinking It Was Staging and the Site Handled It Better Than Expected
Load tested production by accident. 47,000 concurrent users. Zero errors. Staging crashes at 47 users. We use staging less now.
Our Application Depends on a Package Maintained by a Single Developer in Belarus Who Has Gone Missing
The Belarusian dev maintains a package that handles auth for 47,000 apps. Last commit: 2018. The internet runs on faith.
The Entire Backend Was a Set of Excel Macros That a Finance Intern Wrote in 2014
Excel macros from a 2014 intern run production. The intern is now a VP at Google. The macros still work. Nobody touches them.
We Had Two Production Databases and Nobody Knew Which One Was Real
Two production databases. Different data. We flipped a coin during deploys. The coin was a d20 with 14 faces labeled staging.
The Cron Job That Was Supposed to Run Every Hour Has Been Running Every Second Since 2019
The cron job ran every second for 5 years. It sent 157 million reminder emails. Zero users clicked anything.
I Fixed a Typo in a Comment and the Entire Build Pipeline Broke Because CI Parses Comments as Config
Fixed a typo in a comment. The build broke. CI parses comments for config. The typo was the actual deployment instruction.
The Production SSH Key Was in a Public GitHub Repo for 4 Years and Nobody Noticed
Production SSH key in a public repo since 2022. 4,000 clones. Zero malicious logins. Hackers probably felt bad for us.
We Migrated From MySQL to Postgres to SQLite to MongoDB to a Google Sheet in 18 Months
MySQL to Postgres to SQLite to Mongo to Sheets. Each migration took 6 months. We ended up back where we started.
Our Entire Monitoring Stack Crashed Because It Generated Too Many Alerts About Itself
Monitoring generated 47,000 alerts about its own health. We missed the alert about the actual database being on fire.