The cache that lied
Every cache is a bet against time. This one lost.
The Setup
We had a distributed cache sitting in front of our user preferences service. Standard pattern: check cache first, fall back to database, populate cache on miss. TTL of 15 minutes. Simple.
The Lie
Except the cache wasn't expiring. Or rather, it was expiring exactly when it said it would—but the clock it was checking against had drifted. One node was 47 seconds behind. Another was 2 minutes ahead.
The Symptom
Users would update their preferences. Sometimes the change would stick. Sometimes it wouldn't. No pattern we could find in the logs. QA couldn't reproduce it. Production kept lying.
The Fix
NTP wasn't the problem. NTP was fine. The problem was that we were using local system time for TTL checks instead of the cache server's time. Three lines of code. Two weeks of debugging.
The Lesson
Distributed systems have exactly one clock: the one you agree on. Everything else is mostly fiction.