Back to notes
FAILURE MODES

The cache that lied

Every cache is a bet against time. This one lost.

The Setup

We had a distributed cache sitting in front of our user preferences service. Standard pattern: check cache first, fall back to database, populate cache on miss. TTL of 15 minutes. Simple.

The Lie

Except the cache wasn't expiring. Or rather, it was expiring exactly when it said it would—but the clock it was checking against had drifted. One node was 47 seconds behind. Another was 2 minutes ahead.

The Symptom

Users would update their preferences. Sometimes the change would stick. Sometimes it wouldn't. No pattern we could find in the logs. QA couldn't reproduce it. Production kept lying.

The Fix

NTP wasn't the problem. NTP was fine. The problem was that we were using local system time for TTL checks instead of the cache server's time. Three lines of code. Two weeks of debugging.

The Lesson

Distributed systems have exactly one clock: the one you agree on. Everything else is mostly fiction.