Design a Consensus System (Raft / Paxos / etcd)
Raft leader election, log replication, snapshots - and the CAP theorem in operational practice. The substrate every other distributed system stands on.
The problem
Design a strongly-consistent distributed coordination service - the kind that backs etcd, ZooKeeper, Consul. Multiple replicas accept writes. They agree on a totally-ordered log of operations. The system survives minority failures without data loss and without serving stale committed reads.
This is "implement Raft" framed as a system design question. Strong candidates explain leader election, log replication, the safety invariants, and the snapshot mechanism. Excellent candidates draw the operational reality - latency budgets, quorum sizing, cross-AZ vs cross-region trade-offs, and when consensus is the wrong tool.
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →What's the workload - small metadata (etcd-style) or full database (Spanner-style)?
- →Throughput target - hundreds, thousands, or tens of thousands of writes/sec?
- →Read consistency - linearizable, serializable, or eventual?
- →Cluster size - 3, 5, 7 nodes? Cross-AZ or cross-region?
- →Maximum allowed write latency - 10ms (single AZ), 100ms (cross-region)?
- →Snapshot frequency requirements - how much disk are we willing to spend?
- →Multi-tenancy - one cluster per use case, or shared?
- →Disaster recovery - active-passive cross-region, or eventual restore from backup?
Requirements
Functional requirements
- ·Propose: any node accepts a value; the cluster agrees on a total order
- ·Linearizable reads (read sees all writes that completed before)
- ·Watch / subscribe: clients notified when a key changes
- ·Atomic compare-and-swap (CAS) primitives
- ·Membership change: add or remove cluster nodes safely
- ·Snapshot and restore for compaction and recovery
- ·Survive any minority failure without data loss
Non-functional requirements
- Scale
- 10K-50K writes/sec for etcd-class. 1M+ keys. Up to 1GB total state. 5-7 node clusters. Read-heavy: 100K-1M reads/sec via read replicas / leader leases.
- Latency
- Write p99 < 20ms (single AZ), < 100ms (cross-region). Linearizable read p99 < 10ms (with leader lease) or < 50ms (round-trip quorum read).
- Availability
- Cluster survives floor((N-1)/2) failures - 5-node cluster survives 2 failures. During leader election (~150-1500ms), brief write unavailability. Reads can continue from replicas during election if relaxed consistency is acceptable.
- Consistency
- Linearizable. Every successful write is durable on a majority before ack. Every read returns the latest committed value (or fails). The system is CP (consistent + partition-tolerant); during partition, the minority side becomes unavailable.
Capacity estimation
State
- 1M keys × 1KB avg = 1 GB total state. Fits in memory on every node.
- Log: append-only. Trimmed by snapshots. Hot retention ~10K entries / 100MB.
Storage per node
- WAL (write-ahead log) on disk. ~10 GB rolling.
- Snapshots: ~1 GB each, kept ~3 generations. ~3 GB.
- Total: ~15 GB SSD per node. Trivial.
Throughput
- 50K writes/sec ceiling for a 5-node Raft cluster on commodity hardware.
- Each write: leader appends, replicates to 2+ followers, fsyncs WAL. ~5ms steady-state latency on local SSD + same-AZ network.
- Read throughput much higher - 500K+/sec from leader's memory state with linearizable read protocol.
Network
- Replication: leader sends append-entries to N-1 followers per write. ~3KB per RPC × 50K writes/sec × 4 followers = 600 MB/s outbound from leader.
- Cross-AZ inter-node traffic: 600 MB/s steady. Manageable.
Latency budget
- Local SSD fsync: 0.1-1ms.
- Same-AZ network round-trip: 0.5-1ms.
- Cross-AZ round-trip: 1-2ms.
- Cross-region round-trip: 30-200ms.
A write requires 1 fsync at leader + 1 majority of follower fsync acks. Same-AZ: ~3-5ms. Cross-AZ: ~5-10ms. Cross-region: 50-200ms.
Cluster size
- 3 nodes: tolerates 1 failure. Minimum for production.
- 5 nodes: tolerates 2 failures. Standard.
- 7+ nodes: tolerates 3 failures. Slower writes (more replicas to ack); rare beyond 5.
High-level architecture
The cluster is a small set of replicas, each running the same state machine. They elect one node as leader; the leader serializes all writes into a totally-ordered log; the log is replicated to a majority of replicas before being committed. Once committed, every replica applies the entry to its state machine.
The Raft protocol decomposes into three sub-problems:
- Leader election: pick exactly one leader at a time.
- Log replication: leader replicates entries to followers; majority is committed.
- Safety: prevent inconsistencies during leader changes.
The defining invariant: a committed entry is durable on a majority and is never overwritten. This is what makes the system linearizable.
Replica state machine
The application logic. A key-value map (etcd), file-tree (ZooKeeper), or arbitrary state. All replicas must apply log entries in the same order to converge.
Raft module
Implements the Raft protocol. Manages leader election, log replication, snapshot install, membership changes. Pluggable below the state machine.
Write-ahead log (WAL)
Durable append-only log on local disk. Each Raft entry is fsynced before ack. The log is the source of truth for recovery.
Snapshot store
Periodic compacted state machine snapshot. Lets a starting node skip replaying the entire log. Stored on local disk; optionally backed up to object storage.
Replication RPC layer
Leader → follower RPCs: AppendEntries (replicate), RequestVote (election), InstallSnapshot (catch up far-behind followers). Typically gRPC.
Client-facing API
External interface (REST / gRPC / Thrift) that maps client operations to Raft proposals. Routes to leader; redirects clients on leader change.
Lease manager
Implements leader leases for fast linearizable reads. Leader holds a lease; while valid, it can serve reads without quorum round-trip.
Watch service
Tracks client subscriptions to key prefixes. On state machine apply, fires notifications. Important for coordination patterns (config change, leader election, lock).
Membership manager
Handles cluster reconfiguration (add / remove node) safely via joint consensus or single-node changes. Membership changes are themselves Raft-replicated entries.
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. Leader election: terms, votes, and split-vote
Raft's first job: elect exactly one leader at a time, even with arbitrary failures.
Roles
At any moment, each node is in one of three roles: follower (passive, applies entries from leader), candidate (running for election), or leader (replicating entries).
Term
A monotonically increasing integer. Each new election bumps the term. Stale leaders (with older terms) detect this when they hear from a higher-term node and step down.
Election triggering
A follower has an election timeout (randomized between 150-300ms typically). If it doesn't hear from the leader within the timeout, it becomes candidate, increments its term, votes for itself, and requests votes from peers.
Voting rules
A node votes for at most one candidate per term. Candidate wins if it gets a majority. Vote granted only if:
- Candidate's term >= voter's term.
- Voter hasn't already voted this term.
- Candidate's log is at least as up-to-date as voter's (specific rule: highest term in last entry, then highest index).
The "up-to-date" rule prevents a candidate with stale log from winning - a critical safety property.
Split-vote
Multiple candidates start elections simultaneously, none gets a majority. All time out, increment term, retry. Without randomization, this could loop forever.
The fix: randomized election timeouts (150-300ms range). One node will time out first, become candidate, and most likely win before others start. If split happens, it usually resolves in 1-2 rounds.
Election latency
- Best case: leader fails, one follower times out at ~150ms, wins immediately. Total ~150-200ms.
- Worst case (split-vote, 2-3 rounds): ~500ms-1.5s.
This is the unavailability window for writes during leader failure. Tune timeouts based on requirements:
- Lower timeout: faster failover, more false positives (network blip → election).
- Higher timeout: stable, but slow recovery.
Pre-vote optimization
A node about to start an election first asks "would you vote for me?" without incrementing term. If it can't get majority, it stays follower. Avoids unnecessary term bumps when the network is partitioned.
Leader transfer
For planned operations (rolling restart, drain a node), the current leader can voluntarily transfer leadership: nominate a follower, ensure it's caught up, send TimeoutNow. The follower starts an election immediately and wins. Faster and cleaner than waiting for timeout.
2. Log replication and the commit rule
Once leader is elected, replicating writes is mostly mechanical - but the safety invariants are subtle.
AppendEntries
Leader receives client write. Appends to its log. Sends AppendEntries RPC to followers with the new entry plus a "previous index/term" hint.
Each follower checks: does my log have an entry at the previous index with that term? If yes, append. If no, reject (log inconsistency).
On rejection, leader decrements the next-index for that follower and retries. The protocol marches the leader's view of the follower's log back until it finds a matching point, then catches up forward.
Commit rule
An entry is committed when:
- It's been replicated to a majority of replicas (including leader).
- The leader's term equals the entry's term (i.e., entry was created in current term).
The second condition is critical. It prevents a subtle bug: a leader from term T1 replicates entry at index I but doesn't commit before crash. New leader from term T2 may overwrite it. Even if T2 sees the entry replicated to a majority, it can't commit it just on majority alone - it must commit something from term T2 first, which transitively commits the older entry.
Why the second condition matters (Figure 8 in the Raft paper)
Without it, a leader could crash, a new leader from a later term could overwrite a committed entry. Real systems hit this. The condition is non-negotiable.
Heartbeats
The leader sends empty AppendEntries periodically (every 50-100ms typically) to maintain its leadership and prevent followers from starting elections.
Pipelining
Naive: leader sends entry I, waits for ack, sends I+1. High-latency.
Optimized: leader sends I, I+1, I+2 in flight. Acks come back asynchronously. Throughput limited by network bandwidth, not RTT.
Modern Raft implementations (etcd-raft, hashicorp/raft, tikv/raft) all pipeline.
Batching
Group N small writes into one AppendEntries RPC. Reduces fsync overhead (one fsync per batch, not per entry). Increases throughput dramatically (10x+).
Trade-off: latency rises with batch size. Tunable.
Flow control
Leader tracks each follower's match-index (highest replicated). If a follower lags, leader sends only a window of entries; doesn't drown it.
If a follower is far behind (out of log retention), leader sends InstallSnapshot instead of replaying the whole log.
3. Snapshots and log compaction
The log can't grow forever. Snapshots compact old entries.
The snapshot mechanism
Each replica periodically (every N entries or T minutes) writes a snapshot of its state machine to disk. The snapshot includes the last applied index and term.
Once snapshotted, log entries below the snapshot index can be discarded. Disk usage stays bounded.
Snapshot installation
A new node joining (or one too far behind to catch up via log) receives the leader's snapshot via InstallSnapshot RPC. It loads the snapshot, then applies any subsequent log entries.
InstallSnapshot is chunked (snapshots can be GBs). Each chunk acked. The receiver writes to a temp file; on completion, atomically swaps in.
Snapshot frequency vs replay cost
- Frequent snapshots (every 1K entries): smaller log retention, more snapshot overhead.
- Infrequent (every 1M entries): larger log, longer cold-start replay.
Default in etcd: 100K entries. Tune based on entry size and crash-recovery time SLA.
Streaming snapshots vs blocking
Naive: pause the state machine, write snapshot, resume. Blocks writes during snapshot.
Modern: copy-on-write or fork() to take a consistent point-in-time view; serialize asynchronously while writes continue. Standard in production implementations.
Snapshot to object storage
For backup / DR, periodically push snapshots to S3. Restore: bootstrap a new cluster from snapshot. RPO = snapshot interval.
Compaction lag
If a follower is taking longer than expected to apply, leader's log can't be compacted past the slowest follower's match-index. Slow followers prevent compaction; if persistent, kick them out and let them rejoin via snapshot install.
4. Linearizable reads: leader leases and the read index
Linearizable reads are the read-side equivalent of consensus on writes. Done naively, every read needs a quorum round-trip. Optimizations cut this dramatically.
The naive approach: quorum read
For a linearizable read, the leader exchanges a heartbeat with a majority of followers, then serves the read from its state. The heartbeat confirms the leader is still legitimate (no newer leader has taken over).
Cost: 1 round-trip per read. Acceptable for low-throughput coordination services; expensive at high read rates.
Read index optimization (Raft paper §6.4)
- Leader records its current commit-index as read-index.
- Leader confirms its leadership by exchanging heartbeats with majority.
- Once leader's apply-index >= read-index, serve the read from local state.
Why this is linearizable: any read served sees all writes committed before the read started. The heartbeat round prevents reading from a deposed leader.
Cost: 1 round-trip but batchable. Many reads can share one heartbeat round.
Leader leases
The leader holds a lease that's valid for a time window (~10s typically). While the lease is valid, the leader is guaranteed to be the unique leader (no other node could have won an election). Reads can be served without any round-trip.
Lease invariant: leases are granted by followers as part of voting. Followers won't vote for a new leader until their grant of the lease expires. As long as the leader's clock isn't ahead of follower clocks by more than the lease, the lease is safe.
Clock skew danger
Leases assume bounded clock skew. If clocks skew by more than the lease window, two leaders could believe they hold the lease simultaneously. NTP is critical; cloud VMs typically skew <1ms in practice.
Some implementations use a hybrid (lease + occasional verification heartbeat) for safety margin.
Stale read trade-offs
Some clients accept stale reads (read-your-writes is enough). They can read directly from any replica without leader involvement. Most consensus systems offer both APIs - linearizable (slow) and stale (fast).
Read scaling
With leader leases, reads scale to the leader's CPU/memory ceiling - a single Go process can serve 100K+ linearizable reads/sec. Beyond that, follower reads (lease-based on followers, more complex) extend the ceiling.
5. Membership changes and the joint consensus problem
Adding or removing nodes safely is harder than it looks.
The naive approach (broken)
Decree the new membership, switch over. Problem: during the switchover, a partition could create two majorities - one from old config, one from new config - electing two leaders simultaneously.
Joint consensus (Raft paper §6)
Phase 1: leader proposes a "joint" configuration that includes both old and new members. While in joint, decisions require majorities from BOTH old AND new sets - so no single config can act unilaterally.
Phase 2: once joint config is committed, leader proposes the new (final) config. Once committed, joint is retired. Old-only members can be removed.
Two configuration changes; safe under any partition pattern.
Single-node changes (alternative)
Change membership one node at a time. With odd cluster sizes, a single add/remove can't create two disjoint majorities (the majority overlaps).
Simpler than joint; supported by etcd's raft library. Most production systems use this for routine ops.
Pitfall: the new node hasn't caught up
If you add node N6 to a 5-node cluster, and N6 has empty log, you've added a non-functional voter. If two old nodes fail before N6 catches up, the cluster loses majority.
Solution: add as a non-voter (learner) first. N6 catches up via InstallSnapshot + log replication. Once caught up, promote to voter via a second config change.
etcd, TiKV, CockroachDB all use the learner pattern.
Removing the leader
The leader can step down by transferring leadership to another node first, then removing itself. Cleaner than letting it crash mid-removal.
Cluster too small
2-node clusters offer no fault tolerance (1 failure = no majority). 1-node "clusters" are not consensus, just a single point of failure dressed up.
3 is the minimum for production. 5 is the standard for important services. Beyond 7, the throughput cost (more replicas to ack) usually exceeds the marginal availability gain.
6. When consensus is the wrong tool
Consensus is expensive. Use it where you need it; avoid it where you don't.
Where you need consensus
- Service discovery / config (etcd, Consul, ZooKeeper).
- Distributed locks / leader election.
- Metadata for distributed databases (Spanner, TiDB control plane).
- Small, frequently-read, infrequently-written state.
- The control plane of distributed systems generally.
Where consensus is the wrong tool
- High-throughput data stores. Consensus per write doesn't scale to 1M writes/sec. Use eventual consistency + CRDTs (Cassandra), per-shard consensus (Spanner), or async replication (Postgres + replicas).
- Large state. Consensus replicates the full log to every node. 10TB state via Raft is painful; use sharded consensus (each shard is its own small Raft group).
- Cross-region writes. Inter-region consensus latency is 50-200ms per write. Most apps can't tolerate this. Use eventual consistency cross-region; consensus within region.
The CAP framing in practice
CAP says: during partition, choose consistency or availability. Consensus systems choose consistency - the minority side stops accepting writes. The majority side continues.
For etcd / control planes: this is correct. A split-brain configuration system corrupts everything downstream.
For user-facing systems: availability often wins. Cassandra serves both sides of a partition, reconciles later. The trade-off is application-specific.
Sharded consensus: the production answer for scale
Spanner and CockroachDB shard data by key range. Each shard has its own 3-5 node Raft group. 1000 shards = 1000 independent consensus groups. Aggregate throughput scales with shards.
Per-shard ops are linearizable (within shard). Cross-shard ops use 2PC over Raft groups for ACID transactions. Complex but achievable.
The "we just need a leader" pattern
Many use cases need consensus only to elect a single coordinator (job scheduler, primary among replicas). Once elected, the coordinator does the work without per-op consensus.
etcd / ZooKeeper provide this primitive. Almost every distributed system uses one of them for leader election rather than implementing Raft directly.
Don't roll your own
Implementing Raft correctly is famously hard. Use etcd-raft, hashicorp/raft, or atomix/copycat as a library, or run etcd / ZooKeeper / Consul as a service. The bugs in custom implementations are usually safety bugs - silent data loss under specific failure timings.
Trade-offs
Raft vs Paxos
Raft is designed for understandability; Paxos is the historical original. Raft is the de-facto standard for new systems. Multi-Paxos is roughly equivalent in capability; the choice is rarely Paxos for new code.
Cluster size 3 vs 5 vs 7
3: tolerates 1 failure, fastest writes. 5: tolerates 2, standard for production. 7: tolerates 3, slower writes; rare. Beyond 7, throughput cost usually exceeds marginal availability.
Same-AZ vs cross-AZ vs cross-region
Same-AZ: lowest latency, no AZ fault tolerance. Cross-AZ: standard. Cross-region: high latency (50-200ms per write); use only when DR justifies it.
Linearizable reads via lease vs quorum
Lease: zero round-trip, depends on bounded clock skew. Quorum: one round-trip, no clock assumption. Lease is the standard; quorum is the safety fallback.
Strong consistency everywhere vs sharded consensus
Single-cluster strong consistency caps at the leader's CPU. Sharded consensus scales to thousands of independent groups but requires shard management and cross-shard coordination for atomic ops.
Synchronous vs asynchronous replication
Synchronous (consensus): durability before ack, higher latency. Asynchronous (Postgres followers): low write latency, possible data loss on primary failure. Pick by SLA.
Large state per replica vs sharded state
Large (full state on every replica) is simple but caps total state at single-node memory/disk. Sharded distributes state but adds management complexity.
Self-host etcd vs managed
Managed (AWS / Azure provide etcd-class services in some products) is simpler but more expensive and less flexible. Self-host etcd is the standard; ops burden is moderate (mostly forgetting it exists if sized correctly).
Watches vs polling
Watches: low latency, server resources scale with subscriber count. Polling: no server-side state, high latency. Watches are the standard for coordination primitives.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How would you migrate from a 3-node Raft cluster to a 5-node cluster with no downtime?
- ?What changes if you need cross-region active-active with bounded staleness?
- ?How do you debug a cluster stuck in repeated leader elections?
- ?What's your strategy when one replica's disk is consistently slower than others?
- ?How would you implement a distributed lock on top of etcd?
- ?What's the trade-off between Raft and a CRDT for your config service?
- ?How do you handle a network partition that creates two minority groups (no majority)?
- ?How would you bootstrap a fresh cluster from an old snapshot in a DR scenario?
Related system design topics
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →