Back to notes
FAILURE MODES

The deploy that deleted production

We didn't mean to. Nobody ever does.

The Context

Database migration. New column. Backfill script. Standard stuff. We'd done it a hundred times.

The Mistake

The backfill script had a WHERE clause. The WHERE clause had a typo. Instead of updating rows where `status = 'pending'`, it updated rows where `status = 'pending' OR 1=1`. Every row. Every table that FK'd to it. Cascade delete was on.

The Recovery

Point-in-time recovery. 4 hours of data loss. 6 hours of downtime. One very long post-mortem.

The Changes

1. No more cascade deletes. Ever.

2. All migrations run in transactions with explicit rollback plans.

3. Backfill scripts run against a replica first.

4. WHERE clauses get reviewed by two people who didn't write them.

The Learning

The scariest bugs aren't the ones that crash. They're the ones that succeed.