FAILURE MODES
The deploy that deleted production
We didn't mean to. Nobody ever does.
The Context
Database migration. New column. Backfill script. Standard stuff. We'd done it a hundred times.
The Mistake
The backfill script had a WHERE clause. The WHERE clause had a typo. Instead of updating rows where `status = 'pending'`, it updated rows where `status = 'pending' OR 1=1`. Every row. Every table that FK'd to it. Cascade delete was on.
The Recovery
Point-in-time recovery. 4 hours of data loss. 6 hours of downtime. One very long post-mortem.
The Changes
1. No more cascade deletes. Ever.
2. All migrations run in transactions with explicit rollback plans.
3. Backfill scripts run against a replica first.
4. WHERE clauses get reviewed by two people who didn't write them.
The Learning
The scariest bugs aren't the ones that crash. They're the ones that succeed.