gitGood.dev
Caching

Design a Distributed Cache (Memcached / Redis Cluster)

Hard Premium

Consistent hashing, eviction, replication, and what really happens when a single hot key takes down the cluster.

The problem

Design a distributed in-memory cache: a cluster of nodes that stores key-value pairs with sub-millisecond access latency, sharded across the fleet, with eviction, replication, and graceful failure handling. Clients call get(key) / set(key, value, ttl) / del(key); the system handles the routing, sharding, eviction, and failures invisibly.

This question is dense with specific algorithmic choices (consistent hashing, eviction, quorums) and is a favorite for distributed-systems-heavy roles. Strong candidates name the algorithms, explain the trade-offs, and discuss failure modes explicitly.

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

  • What's the dominant access pattern - read-heavy (cache-aside) or write-heavy (write-through)?
  • Are values small (< 1KB, like session tokens) or large (MB, like rendered pages)? Mixed?
  • What's the scale - total memory footprint, ops/sec, number of nodes?
  • Multi-region or single-region? If multi-region, active-active or active-passive?
  • Persistence required, or is the cache purely volatile?
  • Consistency requirements - linearizable, eventual, or 'best effort'?
  • What's the failure tolerance - how many node failures must the system survive without data loss / availability loss?
  • Are clients trusted (internal) or untrusted (public)? Authentication, encryption needs?

Requirements

Functional requirements

  • ·GET key → value (or miss)
  • ·SET key, value, ttl
  • ·DEL key
  • ·Atomic increment / compare-and-swap (optional)
  • ·Cluster transparently shards data across nodes
  • ·TTL-based expiration
  • ·LRU/LFU/TinyLFU eviction when memory is full

Non-functional requirements

Scale
10 TB total cached data, 100 nodes (100 GB each), 10M ops/sec aggregate, 100K ops/sec/node steady-state, peak 500K ops/sec/node.
Latency
p99 < 1ms intra-region, p999 < 5ms. Network round-trip dominates - the cache itself responds in microseconds.
Availability
99.99% available. A single node failure must not impact availability; data on the failed node is either replicated or rebuilt from the source.
Consistency
Eventual consistency by default. Optional strong consistency on a per-key basis for use cases that need it (rare; almost always cache-aside is fine).

Capacity estimation

Memory

  • 10 TB across 100 nodes = 100 GB/node. Reserve 20% headroom for fragmentation, memory churn during eviction → 80 GB usable per node.

Throughput

  • 10M ops/sec aggregate. Distributed across 100 nodes = 100K ops/sec/node. Modern hardware easily handles 500K ops/sec on a single node for small key/value pairs (Redis at ~1M ops/sec, Memcached higher).

Network

  • Average op = 1KB request + 1KB response = 2KB on the wire. 10M ops/sec × 2KB = 20 GB/s aggregate. Per node: 200 MB/s = 1.6 Gbps. Need 10 Gbps NICs minimum.

Connections

  • Each app server holds a persistent connection to every cache node. 1000 app servers × 100 cache nodes = 100K connections per cache node. Manageable - cache servers do this all day.

Replication factor

  • RF = 3 (one primary + two replicas). Effective capacity drops from 10 TB to 3.3 TB usable. Worth it for failure tolerance: any 2 nodes can fail without data loss for a partition.

High-level architecture

The system has three layers:

  1. Client library (sometimes called a "smart client"): hashes the key, picks the right node, retries on failure. Most of the routing intelligence lives here.
  2. Cache nodes: in-memory hash table per shard, with eviction, TTL, replication coordinator.
  3. Cluster coordinator (sometimes consensus-based, sometimes a lightweight membership service): tracks which nodes are alive, what shards they own, and propagates topology changes.

Critical design decision: where does sharding logic live? Two patterns:

  • Client-side sharding (Memcached, Redis Cluster's smart-client mode): clients know the topology, route directly. Lower latency (no proxy hop). Higher complexity in clients.
  • Proxy-based sharding (Twemproxy, Envoy): clients talk to a proxy that knows the topology. One extra hop. Simpler clients. Easier to upgrade topology.

For ultra-low-latency systems, client-side wins. For polyglot environments, proxy wins.

Smart client library

Hashes the key (consistent hash), picks the owner node, opens a connection, retries on miss / timeout, handles topology updates.

Cache node

Holds shards. Implements get/set/del, TTL, eviction. Coordinates with replicas. Reports liveness to the coordinator.

Replication coordinator (per node)

Replicates writes to N-1 followers. Handles read fallback when primary is down.

Cluster coordinator / membership service

Source of truth for cluster topology. Detects node failures, triggers shard rebalancing. Often Raft-backed (Redis Sentinel, ZooKeeper, etcd).

Configuration distribution

Pushes topology changes to clients. Gossip protocol or pull-from-coordinator.

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. Consistent hashing and virtual nodes

Naive sharding by hash(key) mod N is wrong. Adding or removing a single node remaps O(N) of the keyspace, causing a cluster-wide cache stampede. Consistent hashing fixes this.

Algorithm
Map each node to a position on a 2^32 ring (e.g., hash(node_id)). For a key, hash it onto the ring; the owner is the next node clockwise. Adding a node remaps only the keys between the new node and its predecessor - O(K/N) keys, not O(K).

Virtual nodes (vnodes)
Each physical node owns multiple positions on the ring (e.g., 256 vnodes per node). This solves two problems:

  1. Smooths load distribution. With 100 physical nodes and no vnodes, hashing variance creates 2-3x load imbalance. With 256 vnodes per node, imbalance drops to ~10%.
  2. Smooths rebalancing. Removing a physical node shifts each of its vnodes to a different physical neighbor, spreading the load.

Shard-to-node mapping
Some systems (Redis Cluster) use a fixed shard count (16,384 slots) instead of a continuous ring. Each shard maps to a node. Easier to reason about; same end behavior.

Topology updates
When the membership service detects a node failure, it publishes a new topology version. Clients hold both the old and new topology briefly (overlap window) to handle in-flight requests, then drop the old.

Hash function choice
MurmurHash, xxHash, or CityHash - all fast, all uniform. Don't use cryptographic hashes (SHA-256) on the hot path; they're 100x slower for no benefit here. Don't use FNV-1a; its distribution is mediocre at high collision rates.

2. Eviction: LRU vs LFU vs TinyLFU

When a node hits its memory ceiling, it must evict. The eviction policy determines which keys leave.

LRU (Least Recently Used)
Maintain a doubly-linked list ordered by access. On hit, move to front. On evict, remove tail. Simple, well-understood, works for recency-skewed workloads.

Problem: LRU is fooled by single-pass scans. A query that scans a million cold keys evicts all the hot ones.

LFU (Least Frequently Used)
Track per-key access count; evict lowest count. Works when frequency, not recency, is the predictor. Problem: a "frequent in 2019, never since" key sits forever. Mitigation: aging (decay counts over time).

ARC (Adaptive Replacement Cache)
Tracks both recency and frequency, dynamically balances between them. Patented by IBM (still encumbered for some uses). Excellent in practice; complex to implement.

TinyLFU + Window-LRU (the modern winner)
A small Window-LRU (1% of cache) admits new keys; a TinyLFU sketch decides whether a candidate for eviction from the main cache is more frequent than the candidate for admission. If yes, swap; if no, don't admit. This solves the scan problem (one-time scans don't displace hot keys).

Used by Caffeine (Java), W-TinyLFU (research), Redis 6+ (with allkeys-lfu).

Practical recommendation
Start with allkeys-lru. Move to TinyLFU if you observe scan pollution (a sudden access pattern shift evicts the hot working set).

Per-key vs per-shard
Eviction is per-shard (per memory pool). Cross-shard eviction would require a global ordering, which costs more than the eviction is worth.

3. Replication and quorum

Replication exists for two reasons: durability under node failure, and read scaling. The trade-off: how many replicas, and what's the consistency guarantee?

Asynchronous primary-replica (default for most caches)
Writes go to the primary; primary replicates to followers asynchronously. Followers serve reads.
Pros: low write latency.
Cons: a primary failure can lose un-replicated writes. Reads from followers can be slightly stale.

Synchronous primary-replica
Primary waits for at least N replicas to ack before returning success.
Pros: no data loss on single-node failure.
Cons: write latency = max(replica latencies). One slow replica drags the whole write.

Quorum (Dynamo-style)
N replicas total, write to W replicas, read from R replicas. If W + R > N, reads see the latest write. Common: N=3, W=2, R=2.
Pros: tunable consistency. Survives N - W replica failures on writes, N - R on reads.
Cons: read amplification (each read hits R replicas), reconcile divergent values.

For caches, async primary-replica is usually correct. Caches are recoverable - on data loss, refill from source. The latency cost of synchronous replication is rarely worth the durability gain.

Failover
On primary failure: coordinator detects (heartbeat timeout, ~3 seconds). Promotes the most up-to-date replica. Updates topology. Clients switch via topology pull.

Window of inconsistency: ~5 seconds during failover. Clients may see misses for affected keys. App tier should treat misses as cache misses (fetch from source) - don't crash on miss.

Split-brain prevention
A network partition can cause two primaries to be elected for the same shard. Solutions:

  • Quorum-based promotion: a replica only promotes if it can reach a majority of cluster.
  • Fencing tokens: each new primary gets a monotonic generation number. Stale primaries (lower generation) are rejected.

Caches typically tolerate brief split-brain (you might have stale data) since the source of truth is elsewhere.

4. Hot key mitigation

Even with perfect consistent hashing, a single hot key (one viral product, one celebrity profile) can saturate the node that owns it. Detection is straightforward; mitigation has options.

Detection
Each node maintains a top-K access frequency sketch (Count-Min Sketch or simple top-100 LRU). When a single key exceeds N rps (e.g., 10K rps), flag it as hot.

Mitigation 1: Local replication
Push hot keys to all nodes. Reads stay local; only writes hit the original owner. Trades memory for read scale. Effective; bounded blast radius.

Mitigation 2: Request coalescing
At the app tier (or in the smart client), suppress duplicate in-flight requests for the same key. 1000 concurrent gets become 1 backend get + 999 followers waiting on the same future. Reduces hot-key load by 10-100x in burst scenarios.

Mitigation 3: Read-your-writes via primary; reads via replicas
Hot reads are served by replicas (linearly scalable - add more replicas). Writes still bottleneck on the primary, but writes are usually << reads.

Mitigation 4: Sharded hot keys
For a hot key with a list value (top trending products), split into shards: trending_0, trending_1, ..., trending_N. Each shard lives on a different node. Reads pick a random shard. Trades exact correctness for load distribution.

Anti-pattern: cache stampede
A hot key expires; all clients miss simultaneously; all hit the source DB; DB melts. Mitigation: probabilistic early refresh (refresh as TTL approaches) and request coalescing on the source side.

Cell-based architecture (Twitter's solution)
Partition users into cells. Each cell has its own cache cluster, smaller and locally hot. A single hot user is a hot key on their cell's cluster - bounded blast radius.

5. Multi-region and active-active

Single-region caches are easy. Multi-region adds new failure modes and trade-offs.

Active-passive
Primary region serves all traffic; secondary holds a warm standby. On primary region failure, DNS/anycast routes to secondary. Secondary cache is warmed by replication or by replay from source.
Pros: simple consistency.
Cons: secondary capacity sits idle in steady state.

Active-active
Both regions serve traffic for their local users. Writes replicate cross-region asynchronously.
Pros: full capacity utilization, lower latency for users (regional reads).
Cons: cross-region replication conflicts.

Conflict resolution
Two regions write the same key concurrently. Approaches:

  • Last-writer-wins (LWW) by timestamp. Simple. Loses one of the writes.
  • CRDTs (Conflict-Free Replicated Data Types). For sets, counters, lists: mathematically guaranteed convergence. Use Redis CRDTs or build your own.
  • Vector clocks + application-level resolution. Most flexible; most complex.

For caches, LWW is almost always sufficient. The source of truth is downstream; cache divergence is recoverable.

Geographic affinity
Route users to their local region's cache. A user moving regions (rare) hits a cold cache briefly; warmed by their next few requests.

Replication lag and SLAs
Cross-region replication lag is typically 50-500ms. If your app cares about read-your-writes across regions, you must handle this explicitly (route writes back to the user's home region).

Cost
Multi-region replication doubles or triples your cache footprint and adds significant cross-region bandwidth costs ($0.02/GB on AWS). For a 10 TB cache with 10M ops/sec, cross-region replication can cost more than the cache itself. Justify it by user latency or DR requirements - not because it sounds good.

6. Cache stampede, thundering herd, and warming

Three failure modes that interviewers love.

Cache stampede
A hot key expires; all clients simultaneously miss; all simultaneously refresh against the source. The source crashes.

Mitigations:

  • Probabilistic early refresh: refresh as TTL approaches expiration. Probability of refresh increases as TTL → 0. Spread the load over the last 10% of the TTL window.
  • Request coalescing on source: source-side mutex (or async future) deduplicates concurrent fetches.
  • Stale-while-revalidate: on miss, serve the expired value while a background fetch refreshes. Bounded staleness; no stampede.

Thundering herd on cold start
Cluster restart or new region brings up empty caches. App tier hits source for everything. Source overwhelmed.

Mitigations:

  • Pre-warm: replay recent traffic to warm the cache before serving production.
  • Slow start: ramp app tier traffic to the new cluster gradually (5%, 10%, 50%, 100%).
  • Source-side rate limit: protect the source DB regardless of cache state. The cache is an optimization, not a load-bearing dependency.

Hot-key cache warming after node failure
Failed node's shards are reassigned to neighbors. Neighbors have cold caches for those keys. App tier hits the source DB for them. Source briefly load-spikes.

Mitigations:

  • Pre-replicated hot keys: the failed node's hot keys were already on neighbors (per the hot-key mitigation strategy).
  • Stagger reassignment: don't move all shards at once. Reassign one at a time, allowing each new owner to warm before the next.
  • Fall back gracefully on source: source DB has its own caches and circuit breakers - it must withstand the briefly elevated load.

The interview signal: candidates who name these failure modes and discuss mitigations score significantly higher than those who only sketch the happy path.

Trade-offs

Client-side sharding vs proxy
Client-side: lower latency, more complex client. Proxy: simpler client, one extra hop. For ultra-low-latency systems (high-frequency trading, ad serving), client-side. For polyglot environments and easier ops, proxy.

Sync vs async replication
Sync: no data loss; higher write latency. Async: faster writes; possible data loss on primary failure. For caches, async is almost always right - data is recoverable from the source.

LRU vs TinyLFU vs ARC
LRU is the lowest-effort default. TinyLFU is the modern best-in-class. ARC was the gold standard but is patent-encumbered. Move to TinyLFU if you observe scan pollution; otherwise LRU is fine.

Single-region simplicity vs multi-region availability
Multi-region is expensive (capacity + bandwidth) and adds consistency complexity. Don't do it unless you have a real DR or latency requirement. Many apps run a single-region cache with a cold standby and accept brief outages on regional failure.

Smart clients vs dumb clients
Smart clients can do interesting things: client-side load balancing, request hedging, adaptive timeouts. They also need synchronized rollouts when behavior changes. Dumb clients + smart proxies trade latency for ops simplicity.

Cache as truth vs cache as optimization
Treating the cache as the source of truth (write-through, no source DB) is a tempting simplification but couples your durability story to the cache. Most production designs treat the cache as a derived view of the source - the cache can be wiped without data loss.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

  • ?How would you migrate from 100 nodes to 200 nodes with zero downtime?
  • ?What changes if values are large (10 MB rendered pages)?
  • ?How do you handle a customer who needs strict consistency on a per-key basis?
  • ?What's your monitoring story - what do you alert on?
  • ?How do you protect the source DB if the entire cache cluster fails?
  • ?How would you implement transactions (multi-key atomic operations) on a sharded cache?
  • ?What's the security model - authentication, encryption, multi-tenancy?
  • ?How does the design change if you need persistence (Redis with AOF/RDB)?

Related system design topics

Companies that test this topic

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →