The migration that took six months
We thought it would take two weeks. We were off by a factor of twelve.
The Plan
Move from PostgreSQL to... PostgreSQL. Different host. Better hardware. Should be simple: dump, restore, flip DNS.
The First Problem
The dump was 2TB. Restore took 18 hours. We couldn't afford 18 hours of downtime.
The Second Problem
Logical replication. Set it up. Worked great. Except for the tables with no primary key. 340 tables with no primary key.
The Third Problem
Adding primary keys to 340 tables in production. Each one a migration. Each migration a risk. Each risk a conversation.
The Fourth Problem
Replication lag. During peak hours, we'd fall behind. During off-peak, we'd catch up. We needed to cut over during off-peak. Our off-peak was someone else's peak.
The Solution
Six months of: adding primary keys, optimizing queries, reducing write load, testing failover, practicing recovery, documenting everything, and waiting for the right moment.
The Lesson
Every migration is a negotiation with your past self. And your past self made some questionable decisions.