We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Consistent hashing, eviction, replication, and what really happens when a single hot key takes down the cluster.
Design a distributed in-memory cache: a cluster of nodes that stores key-value pairs with sub-millisecond access latency, sharded across the fleet, with eviction, replication, and graceful failure handling. Clients call get(key) / set(key, value, ttl) / del(key); the system handles the routing, sharding, eviction, and failures invisibly.
This question is dense with specific algorithmic choices (consistent hashing, eviction, quorums) and is a favorite for distributed-systems-heavy roles. Strong candidates name the algorithms, explain the trade-offs, and discuss failure modes explicitly.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Memory
Throughput
Network
Connections
Replication factor
The system has three layers:
Critical design decision: where does sharding logic live? Two patterns:
For ultra-low-latency systems, client-side wins. For polyglot environments, proxy wins.
Hashes the key (consistent hash), picks the owner node, opens a connection, retries on miss / timeout, handles topology updates.
Holds shards. Implements get/set/del, TTL, eviction. Coordinates with replicas. Reports liveness to the coordinator.
Replicates writes to N-1 followers. Handles read fallback when primary is down.
Source of truth for cluster topology. Detects node failures, triggers shard rebalancing. Often Raft-backed (Redis Sentinel, ZooKeeper, etcd).
Pushes topology changes to clients. Gossip protocol or pull-from-coordinator.
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
Naive sharding by hash(key) mod N is wrong. Adding or removing a single node remaps O(N) of the keyspace, causing a cluster-wide cache stampede. Consistent hashing fixes this.
Algorithm
Map each node to a position on a 2^32 ring (e.g., hash(node_id)). For a key, hash it onto the ring; the owner is the next node clockwise. Adding a node remaps only the keys between the new node and its predecessor - O(K/N) keys, not O(K).
Virtual nodes (vnodes)
Each physical node owns multiple positions on the ring (e.g., 256 vnodes per node). This solves two problems:
Shard-to-node mapping
Some systems (Redis Cluster) use a fixed shard count (16,384 slots) instead of a continuous ring. Each shard maps to a node. Easier to reason about; same end behavior.
Topology updates
When the membership service detects a node failure, it publishes a new topology version. Clients hold both the old and new topology briefly (overlap window) to handle in-flight requests, then drop the old.
Hash function choice
MurmurHash, xxHash, or CityHash - all fast, all uniform. Don't use cryptographic hashes (SHA-256) on the hot path; they're 100x slower for no benefit here. Don't use FNV-1a; its distribution is mediocre at high collision rates.
When a node hits its memory ceiling, it must evict. The eviction policy determines which keys leave.
LRU (Least Recently Used)
Maintain a doubly-linked list ordered by access. On hit, move to front. On evict, remove tail. Simple, well-understood, works for recency-skewed workloads.
Problem: LRU is fooled by single-pass scans. A query that scans a million cold keys evicts all the hot ones.
LFU (Least Frequently Used)
Track per-key access count; evict lowest count. Works when frequency, not recency, is the predictor. Problem: a "frequent in 2019, never since" key sits forever. Mitigation: aging (decay counts over time).
ARC (Adaptive Replacement Cache)
Tracks both recency and frequency, dynamically balances between them. Patented by IBM (still encumbered for some uses). Excellent in practice; complex to implement.
TinyLFU + Window-LRU (the modern winner)
A small Window-LRU (1% of cache) admits new keys; a TinyLFU sketch decides whether a candidate for eviction from the main cache is more frequent than the candidate for admission. If yes, swap; if no, don't admit. This solves the scan problem (one-time scans don't displace hot keys).
Used by Caffeine (Java), W-TinyLFU (research), Redis 6+ (with allkeys-lfu).
Practical recommendation
Start with allkeys-lru. Move to TinyLFU if you observe scan pollution (a sudden access pattern shift evicts the hot working set).
Per-key vs per-shard
Eviction is per-shard (per memory pool). Cross-shard eviction would require a global ordering, which costs more than the eviction is worth.
Replication exists for two reasons: durability under node failure, and read scaling. The trade-off: how many replicas, and what's the consistency guarantee?
Asynchronous primary-replica (default for most caches)
Writes go to the primary; primary replicates to followers asynchronously. Followers serve reads.
Pros: low write latency.
Cons: a primary failure can lose un-replicated writes. Reads from followers can be slightly stale.
Synchronous primary-replica
Primary waits for at least N replicas to ack before returning success.
Pros: no data loss on single-node failure.
Cons: write latency = max(replica latencies). One slow replica drags the whole write.
Quorum (Dynamo-style)
N replicas total, write to W replicas, read from R replicas. If W + R > N, reads see the latest write. Common: N=3, W=2, R=2.
Pros: tunable consistency. Survives N - W replica failures on writes, N - R on reads.
Cons: read amplification (each read hits R replicas), reconcile divergent values.
For caches, async primary-replica is usually correct. Caches are recoverable - on data loss, refill from source. The latency cost of synchronous replication is rarely worth the durability gain.
Failover
On primary failure: coordinator detects (heartbeat timeout, ~3 seconds). Promotes the most up-to-date replica. Updates topology. Clients switch via topology pull.
Window of inconsistency: ~5 seconds during failover. Clients may see misses for affected keys. App tier should treat misses as cache misses (fetch from source) - don't crash on miss.
Split-brain prevention
A network partition can cause two primaries to be elected for the same shard. Solutions:
Caches typically tolerate brief split-brain (you might have stale data) since the source of truth is elsewhere.
Even with perfect consistent hashing, a single hot key (one viral product, one celebrity profile) can saturate the node that owns it. Detection is straightforward; mitigation has options.
Detection
Each node maintains a top-K access frequency sketch (Count-Min Sketch or simple top-100 LRU). When a single key exceeds N rps (e.g., 10K rps), flag it as hot.
Mitigation 1: Local replication
Push hot keys to all nodes. Reads stay local; only writes hit the original owner. Trades memory for read scale. Effective; bounded blast radius.
Mitigation 2: Request coalescing
At the app tier (or in the smart client), suppress duplicate in-flight requests for the same key. 1000 concurrent gets become 1 backend get + 999 followers waiting on the same future. Reduces hot-key load by 10-100x in burst scenarios.
Mitigation 3: Read-your-writes via primary; reads via replicas
Hot reads are served by replicas (linearly scalable - add more replicas). Writes still bottleneck on the primary, but writes are usually << reads.
Mitigation 4: Sharded hot keys
For a hot key with a list value (top trending products), split into shards: trending_0, trending_1, ..., trending_N. Each shard lives on a different node. Reads pick a random shard. Trades exact correctness for load distribution.
Anti-pattern: cache stampede
A hot key expires; all clients miss simultaneously; all hit the source DB; DB melts. Mitigation: probabilistic early refresh (refresh as TTL approaches) and request coalescing on the source side.
Cell-based architecture (Twitter's solution)
Partition users into cells. Each cell has its own cache cluster, smaller and locally hot. A single hot user is a hot key on their cell's cluster - bounded blast radius.
Single-region caches are easy. Multi-region adds new failure modes and trade-offs.
Active-passive
Primary region serves all traffic; secondary holds a warm standby. On primary region failure, DNS/anycast routes to secondary. Secondary cache is warmed by replication or by replay from source.
Pros: simple consistency.
Cons: secondary capacity sits idle in steady state.
Active-active
Both regions serve traffic for their local users. Writes replicate cross-region asynchronously.
Pros: full capacity utilization, lower latency for users (regional reads).
Cons: cross-region replication conflicts.
Conflict resolution
Two regions write the same key concurrently. Approaches:
For caches, LWW is almost always sufficient. The source of truth is downstream; cache divergence is recoverable.
Geographic affinity
Route users to their local region's cache. A user moving regions (rare) hits a cold cache briefly; warmed by their next few requests.
Replication lag and SLAs
Cross-region replication lag is typically 50-500ms. If your app cares about read-your-writes across regions, you must handle this explicitly (route writes back to the user's home region).
Cost
Multi-region replication doubles or triples your cache footprint and adds significant cross-region bandwidth costs ($0.02/GB on AWS). For a 10 TB cache with 10M ops/sec, cross-region replication can cost more than the cache itself. Justify it by user latency or DR requirements - not because it sounds good.
Three failure modes that interviewers love.
Cache stampede
A hot key expires; all clients simultaneously miss; all simultaneously refresh against the source. The source crashes.
Mitigations:
Thundering herd on cold start
Cluster restart or new region brings up empty caches. App tier hits source for everything. Source overwhelmed.
Mitigations:
Hot-key cache warming after node failure
Failed node's shards are reassigned to neighbors. Neighbors have cold caches for those keys. App tier hits the source DB for them. Source briefly load-spikes.
Mitigations:
The interview signal: candidates who name these failure modes and discuss mitigations score significantly higher than those who only sketch the happy path.
Client-side sharding vs proxy
Client-side: lower latency, more complex client. Proxy: simpler client, one extra hop. For ultra-low-latency systems (high-frequency trading, ad serving), client-side. For polyglot environments and easier ops, proxy.
Sync vs async replication
Sync: no data loss; higher write latency. Async: faster writes; possible data loss on primary failure. For caches, async is almost always right - data is recoverable from the source.
LRU vs TinyLFU vs ARC
LRU is the lowest-effort default. TinyLFU is the modern best-in-class. ARC was the gold standard but is patent-encumbered. Move to TinyLFU if you observe scan pollution; otherwise LRU is fine.
Single-region simplicity vs multi-region availability
Multi-region is expensive (capacity + bandwidth) and adds consistency complexity. Don't do it unless you have a real DR or latency requirement. Many apps run a single-region cache with a cold standby and accept brief outages on regional failure.
Smart clients vs dumb clients
Smart clients can do interesting things: client-side load balancing, request hedging, adaptive timeouts. They also need synchronized rollouts when behavior changes. Dumb clients + smart proxies trade latency for ops simplicity.
Cache as truth vs cache as optimization
Treating the cache as the source of truth (write-through, no source DB) is a tempting simplification but couples your durability story to the cache. Most production designs treat the cache as a derived view of the source - the cache can be wiped without data loss.
Be ready for at least three of these. The first one is almost always asked.
The canonical bounded system design problem. Read-heavy, hot-key prone, and a great vehicle for hashing, caching, and capacity estimation.
The classic write-vs-read amplification trade-off. Push, pull, or hybrid fanout - and how to handle the celebrity user with 100M followers.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →