SharkintoshBlog

Our Error Logging Service Went Down So We Had No Idea Everything Else Was Down Too

Error logging silently went down. So did everything else. We found out when the CEO called because the website was white.

I Ran a Load Test Against Production Thinking It Was Staging and the Site Handled It Better Than Expected

Load tested production by accident. 47,000 concurrent users. Zero errors. Staging crashes at 47 users. We use staging less now.

Our Application Depends on a Package Maintained by a Single Developer in Belarus Who Has Gone Missing

The Belarusian dev maintains a package that handles auth for 47,000 apps. Last commit: 2018. The internet runs on faith.

The Entire Backend Was a Set of Excel Macros That a Finance Intern Wrote in 2014

Excel macros from a 2014 intern run production. The intern is now a VP at Google. The macros still work. Nobody touches them.

We Had Two Production Databases and Nobody Knew Which One Was Real

Two production databases. Different data. We flipped a coin during deploys. The coin was a d20 with 14 faces labeled staging.

The Cron Job That Was Supposed to Run Every Hour Has Been Running Every Second Since 2019

The cron job ran every second for 5 years. It sent 157 million reminder emails. Zero users clicked anything.

I Fixed a Typo in a Comment and the Entire Build Pipeline Broke Because CI Parses Comments as Config

Fixed a typo in a comment. The build broke. CI parses comments for config. The typo was the actual deployment instruction.

The Production SSH Key Was in a Public GitHub Repo for 4 Years and Nobody Noticed

Production SSH key in a public repo since 2022. 4,000 clones. Zero malicious logins. Hackers probably felt bad for us.

We Migrated From MySQL to Postgres to SQLite to MongoDB to a Google Sheet in 18 Months

MySQL to Postgres to SQLite to Mongo to Sheets. Each migration took 6 months. We ended up back where we started.

Our Entire Monitoring Stack Crashed Because It Generated Too Many Alerts About Itself

Monitoring generated 47,000 alerts about its own health. We missed the alert about the actual database being on fire.