Messaging

Design a Distributed Message Queue (Kafka deep-dive)

Hard Premium

Partitions, consumer groups, replication, retention, and the exactly-once myth - the implementation details Kafka users gloss over until they don't.

The problem

Design a distributed message queue that ingests millions of messages per second, durably stores them across machine and data-center failures, lets independent consumer groups read at their own pace, supports replay, and provides ordering guarantees within a partition.

This is "implement Kafka" framed as a system design question. Strong candidates separate the on-disk log structure, the replication protocol, the consumer offset model, and the operational concerns (rebalances, partition rebalancing, broker failures). Weak candidates draw a queue and stop. The signal is depth - everyone knows what Kafka does; few know how.

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

→What's the use case mix - streaming (high throughput, replay) or task queueing (low latency, ack-based)?
→Throughput target - millions of messages/sec aggregate?
→Durability requirements - acceptable loss on a single broker failure? On a full data-center failure?
→Retention - hours, days, weeks? Time-based or size-based?
→Exactly-once needed, or is at-least-once + idempotent consumers acceptable?
→Multi-tenant - one cluster serving many teams, or per-team clusters?
→Latency target - p99 producer ack < 10ms, < 100ms, or doesn't matter?
→Cross-region replication - active-active, active-passive, or single-region?

Requirements

Functional requirements

·Producers append messages to topics (named streams)
·Topics are partitioned for parallelism
·Consumer groups read from topics; each partition consumed by one consumer in a group
·Replay from arbitrary offset (or timestamp)
·Replication for durability across N brokers
·Per-partition ordering guarantee
·Configurable retention (time-based or size-based)
·Log compaction (keep latest value per key) for state-snapshot use cases

Non-functional requirements

Scale: 5M messages/sec ingest, 50M messages/sec consume (10x fanout). 1KB avg message. 10K topics. 1M partitions cluster-wide. 200 brokers per cluster. PB/day write volume.
Latency: Producer p99 < 10ms with acks=1, < 50ms with acks=all. End-to-end producer → consumer p99 < 100ms in the same region.
Availability: 99.99% for producer accept (durable persistence). Brief consumer pauses during rebalances are acceptable. No data loss on single broker failure with proper replication.
Consistency: Per-partition total order. Across partitions, no global ordering guarantee. At-least-once delivery default. Exactly-once via transactions for use cases that need it (Kafka Streams).

Capacity estimation

Storage

Ingest: 5M msg/sec × 1KB = 5 GB/sec = 432 TB/day raw.
Replication factor 3: 1.3 PB/day on disk across the fleet.
Retention 7 days: ~9 PB hot storage. Tiered storage (S3) for older.
Per broker: 200 brokers × ~50 TB SSD = 10 PB total - fits with tiering.

Throughput

5M msg/sec ingest. At ~1KB avg, that's 5 GB/s write bandwidth.
200 brokers → 25K msg/sec/broker average. With replication, each message is written 3x → 75K writes/sec/broker. Sequential append, easy on commodity NVMe.
50M msg/sec consume = 50 GB/s read. Page cache absorbs most reads. Disk reads dominated by replay / lagging consumers.

Network

Inter-broker replication: 3x ingest = 15 GB/s steady within the cluster.
Producer ingress: 5 GB/s.
Consumer egress: 50 GB/s.
Cross-AZ traffic dominates the bill. Plan for $1-3M/year on a cluster this size.

Memory

Page cache is the most important resource. Brokers run with most of their RAM unallocated to JVM, leaving the OS to cache the recent log tail.
Consumer groups state: ~1M partitions × 128 bytes = 128 MB. Trivial.

Connections

Producers: ~10K active. Each holds a long-lived TCP connection to multiple brokers.
Consumers: ~100K. Long-lived connections.
Per broker: ~1K-5K connections. Manageable.

High-level architecture

The cluster has three roles: brokers (store and serve the log), controllers (cluster-state coordinator), and clients (producers and consumers).

A topic is partitioned for parallelism. Each partition is a log file (actually a sequence of segment files) on a leader broker, replicated to N-1 follower brokers. Producers send to the leader; followers fetch and append. Consumers read from the leader by default (or from followers in newer versions for locality).

The consumer group is the parallelism unit on the read side: a group of N consumers shares the work of reading a topic, with each partition assigned to exactly one consumer in the group. Offsets are tracked per (group, partition) on the cluster itself.

The defining performance trick: append-only log on disk + page cache + zero-copy from disk to socket = throughput limited by network, not CPU.

Broker

Stores partition replicas. Handles producer appends and consumer fetches. Participates in replication protocol with other brokers.

Controller

One broker elected as the cluster coordinator. Manages partition leadership assignments, ISR membership, topic creation/deletion. In KRaft mode, controllers run on a separate quorum.

Metadata store (Raft / ZooKeeper)

Holds cluster state: brokers, topics, partitions, leaders, ISR. KRaft (Kafka Raft) replaces ZooKeeper. Strong consistency required.

Partition (log)

Append-only log on disk, split into segment files (typically 1GB each). Index files map offsets to byte positions. Old segments deleted or compacted per retention policy.

Producer

Client library. Batches records. Computes partition (key hash or round-robin). Sends to partition leader. Handles retries and idempotency keys.

Consumer

Client library. Joins a consumer group. Pulls records from assigned partitions. Commits offsets. Handles rebalances.

Group coordinator

A broker designated per consumer group. Tracks membership, assigns partitions, stores committed offsets in an internal __consumer_offsets topic.

Replication protocol

Followers fetch from leader using a long-poll protocol. Leader tracks high-watermark = highest offset replicated to all in-sync replicas. Consumers only see records below the high-watermark.

Tiered storage

Old segments offloaded to S3 / object storage. Brokers serve them on demand from cold tier. Cuts hot storage cost 5-10x for long-retention topics.

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. Partitions: the unit of parallelism, ordering, and pain

Partitions are the central abstraction. Get them right and the system scales; get them wrong and you can't fix it cheaply.

What partitions buy you

Parallelism: N partitions can be consumed by up to N consumers simultaneously.
Ordering: total order is preserved within a partition.
Distribution: partitions are spread across brokers; load balances.

Partition count is not free to change
Adding partitions is allowed but breaks key-based partitioning - existing keys may now hash to different partitions, breaking per-key ordering. Removing partitions is not supported online.

Rule of thumb: provision 2-5x more partitions than current peak consumer parallelism. Most clusters live with 20-50 partitions per topic for moderate-volume topics, hundreds for high-volume.

Partition key choice

By key (e.g., user_id): per-key ordering preserved. Hot key creates hot partition.
Round-robin: perfect load balance, no per-key ordering.
Custom partitioner: domain-specific (e.g., partition by region for locality).

Hot partition mitigation
A celebrity user dominates a partition. Mitigations:

Sub-key the partition (user_id || sub_partition_count); accept slight ordering loss.
Salt the key to spread; use deduplicated downstream where ordering is needed.
Promote known-hot keys to a dedicated topic with more partitions.

Per-broker partition limit
A broker holds 1000-4000 partitions in practice. Beyond that, ZooKeeper / KRaft pressure, leader election storms on broker restart. A 200-broker cluster can hold ~500K-800K partitions safely.

Partition rebalancing
Adding/removing brokers triggers partition movement. The controller plans new assignments; data is replicated to the new leader; old replica is deleted. Reassignment is bandwidth-intensive (TBs flowing internally) and can be throttled to avoid impacting clients.

The partition count anti-pattern
"More partitions = better parallelism" - true to a point. Each partition has overhead: file handles, replication threads, controller metadata. 100K-partition topics are operationally painful. Pick the count for actual peak parallelism + a safety margin, not "as many as possible".

2. Replication and the in-sync replica set (ISR)

Durability comes from replicating each partition across N brokers. The ISR mechanic decides when a write is "committed".

Leader and followers
Each partition has one leader and N-1 followers. Producers write to the leader. Followers fetch from the leader continuously. Followers that have fetched up to the leader's tail recently are "in sync".

The ISR set
The set of replicas (including leader) currently caught up. Configurable lag threshold (replica.lag.time.max.ms, default 30s). A follower that falls behind is removed from ISR.

The high-watermark
The highest offset present in all ISR members. Consumers only see records up to the high-watermark. This is the durability guarantee: a consumer never sees a record that isn't replicated to all ISR members.

Producer ack levels

acks=0: fire and forget. No ack from leader. Lose data on any failure. Almost never use.
acks=1: leader writes to its log, acks. Lose data if leader fails before replicating. Lower latency.
acks=all: leader waits until all ISR members have written. No data loss as long as ISR has >1 member. Higher latency.

The min.insync.replicas knob
Combined with acks=all, this guarantees the data is on at least M brokers before ack. Standard config: replication factor 3, min.insync.replicas=2. Tolerates one broker loss without data loss.

If ISR drops below min.insync.replicas, the leader rejects writes (fails-closed - prefer unavailability over data loss).

Leader election
Leader fails. Controller picks a new leader from the ISR. Strict election: only ISR members eligible. If ISR is empty, choices:

unclean.leader.election.enable=false (default modern): wait for an ISR member to come back. Brief unavailability, no data loss.
unclean.leader.election.enable=true: pick any replica, even out-of-sync. Available immediately, but accepted writes that weren't replicated to this replica are silently lost. Very dangerous - only enable if availability >> durability.

Cross-AZ / cross-region replication

Cross-AZ: replicas placed across AZs by rack-aware assignment. Survives AZ outage. Inter-AZ traffic is the cost.
Cross-region: separate cluster + MirrorMaker / Kafka Connect / Confluent Cluster Linking replication. Async. Recovery point objective (RPO) is replication lag.

Replication failure modes

Slow follower: removed from ISR, no impact on writes (still meet acks=all).
Network partition: leader on one side, some followers on the other. Followers removed from ISR. ISR may shrink to just the leader; if min.insync.replicas not met, writes block.
Leader pause (GC): brief unavailability; controller may not notice fast enough to elect new leader; producers retry, succeed.

3. Consumer groups, offsets, and rebalancing

The consumer group model is elegant on paper and operationally painful in practice. Understanding rebalances is the difference between Kafka and Kafka-as-incident.

Group semantics
A consumer group is a logical reader. The group name identifies it. Each partition is assigned to exactly one consumer in the group. If the group has N consumers and the topic has P partitions:

N <= P: each consumer reads N/P partitions. All consumers active.
N > P: extra consumers idle. Wasted capacity.

Multiple groups can read the same topic independently - each gets its own offset.

Offset storage
Committed offsets stored in __consumer_offsets, an internal compacted topic. Each commit writes (group, topic, partition) → offset.

Compaction keeps only the latest offset per (group, topic, partition). The offset topic stays bounded.

Commit strategies

Auto-commit (every N ms): simple but can lose progress (consumer crashed after processing but before commit) or duplicate work (committed before processing).
Manual sync commit: process, then commit synchronously. Slower but accurate.
Manual async commit: process, fire-and-forget commit. Risk of out-of-order commits if not careful.

The right pattern: process, then sync-commit periodically (every N records or T seconds). Combine with idempotent processing.

Rebalance triggers

Consumer joins / leaves the group.
Consumer heartbeat times out.
Partition count changes for a subscribed topic.
Subscription changes.

During rebalance, all consumers in the group pause - the "stop the world" problem.

Rebalance protocols

Eager (legacy): all consumers revoke all partitions, then rejoin and get new assignments. Pause = full group pause.
Cooperative (modern, KIP-429): only affected partitions are revoked. Other consumers keep working. Pause = single-partition impact.

Modern Kafka clients use cooperative rebalancing. Always upgrade.

Static group membership (KIP-345)
Each consumer has a stable group.instance.id. A restart within session.timeout.ms doesn't trigger a rebalance - the consumer rejoins with the same identity. Reduces rebalance churn from rolling deploys dramatically.

The session timeout knob
session.timeout.ms (default 45s). Heartbeat must arrive within this window or the consumer is presumed dead. Too short: false positives during GC pauses. Too long: dead consumers hold partitions.

Consumer lag
The KPI. lag = log_end_offset - committed_offset. Alarm > 1 minute lag for real-time use cases. Lag growing means production rate exceeds consumption rate; scale up consumers (up to partition count).

The "consumers stuck" pattern
A poison-pill record (one the consumer crashes on) blocks the partition. Consumer crashes, restarts, hits same record, crashes again. Detection: lag growing, consumer logs show repeated crashes. Fix: skip the record (advance offset manually) and route to a DLQ for offline investigation.

4. Exactly-once semantics: where it works and where it doesn't

Kafka's exactly-once is real, but narrower than the marketing implies.

Where it works (read-process-write within Kafka)
Producer emits to topic A. Consumer reads from A, transforms, writes to topic B. Consumer commits its A-offset.

The two writes (to B and to __consumer_offsets) happen in a Kafka transaction. Either both succeed or neither does. No partial state visible to other consumers.

This works because the consumer's "side effect" is itself a Kafka write. Both can be transacted together.

The transactional producer

Producer initializes with a transactional.id (stable across restarts).
producer.beginTransaction()
producer.send(records to B) ... commits go into a pending state, invisible to consumers reading isolation.level=read_committed.
producer.sendOffsetsToTransaction(offsets, group) - includes the commit-offset write in the same transaction.
producer.commitTransaction() - atomically marks both writes visible.

Idempotent producer (KIP-98)
Even without transactions, producers can be idempotent. Each record has a sequence number per (producer_id, partition). Broker rejects duplicates with the same sequence. Eliminates network-retry-induced duplicates.

Always enable enable.idempotence=true. No downside.

Where exactly-once fails: external side effects
Consumer reads from Kafka, calls an external HTTP API. Now you have two write operations in different systems. No global transaction.

Approaches:

Outbox pattern: write the intent to Kafka first; a consumer-side process reads and calls the API with idempotency. Pushes the dedup to the API.
Idempotency key on the API: consumer calls API with a deterministic key. API ignores duplicates. Works when the API supports it.
Two-phase commit (rare): heavy, fragile, mostly avoided.

The honest contract
Kafka offers: at-least-once delivery, idempotent producers, transactional writes within Kafka. With careful patterns, you build effectively-once for end-to-end pipelines. "Exactly once" as a protocol guarantee across systems doesn't exist.

Consumer isolation level

read_uncommitted (default): see all records. Including those from in-flight transactions that may be aborted. Lower latency.
read_committed: only see committed transactional writes. Slightly higher latency (waits for transaction commit). Required for exactly-once consuming.

Performance cost
Transactions add ~5-10ms producer latency (extra coordinator round-trips). Idempotence adds ~1-2ms. Both have negligible throughput impact when batched.

5. Retention, log compaction, and tiered storage

The on-disk log isn't infinite. Three policies decide what stays.

Time-based retention
log.retention.hours (or .ms). Default 168 (7 days). Segments older than this are deleted.

Use case: event logs, where old data has no value. Most topics.

Size-based retention
log.retention.bytes per partition. When exceeded, oldest segments deleted.

Use case: bounded-storage topics where time-based retention would blow disk on bursty topics.

Combined retention: deletion happens when either policy is violated.

Log compaction
For state-snapshot topics. Compaction keeps the latest record per key; older records with the same key are deleted.

Use case: __consumer_offsets, change data capture (CDC) streams, configuration topics. Anywhere "the value at this key" matters more than "the history".

How: a background thread scans segments. For each key, finds the latest offset; older records are eligible for deletion. Tombstones (records with null value) signal deletion - kept for delete.retention.ms (default 24h) so consumers can see the delete, then GC'd.

Compaction uses extra disk briefly while running (hits up to 2x the topic's size). Not free.

Mixed retention (compact + delete)
cleanup.policy=compact,delete. Compact within retention window; delete records older than the window. Use case: change-data-capture where snapshot is needed but ancient history isn't.

Tiered storage (KIP-405)
Hot tier: recent segments on broker SSD.
Cold tier: older segments offloaded to S3 / GCS.

When a consumer reads an offset in cold tier, the broker fetches the segment on demand. Slower (object storage latency) but cheap (~$25/TB-month vs SSD).

Cuts hot storage cost 5-10x for long-retention topics. Critical for use cases like compliance retention (years) or replay windows.

The retention-vs-cost trade-off

1 day retention: enough for most operational use. Cheap.
7 days: enough for most replay scenarios.
30 days: enough for backfills of medium-complexity transforms.
90+ days: only with tiered storage; otherwise costs explode.

Compaction lag and the "stale read" problem
Compaction runs on a schedule, not synchronously. A topic that's been compacted recently might still have old records of a key visible. Consumers that need "current value only" must read all records and dedupe - the topic itself doesn't guarantee freshness, only eventual consistency.

6. Operational concerns: rolling restarts, broker decommission, hot brokers

Running Kafka at scale is mostly an operations problem. The protocol is well-defined; the failure modes are not.

Rolling restart
Replace JVM, kernel, or config on each broker without downtime.

Sequence per broker:

Stop accepting new partition leadership (move leaders off).
Wait for partitions to rejoin ISR after leader move.
Stop the broker. JVM dies cleanly.
Restart with new config.
Wait for full ISR recovery (replicas catch up).
Move some leaders back for balance.
Move to next broker.

Time per broker: 5-30 minutes depending on data size and ISR catchup speed. A 200-broker cluster takes most of a day.

Broker decommission
Replacing a dead broker or downsizing.

Mark broker for decommission (no new leadership).
Trigger partition reassignment to move replicas off.
Reassignment can saturate inter-broker bandwidth - throttle (replica.alter.log.dirs.io.max.bytes.per.second).
Once broker has 0 replicas, decommission cleanly.

A full broker's worth of data movement: 10-50 TB at 100 MB/s = 1-5 days. Plan ahead.

Hot broker problem
Some partitions take more traffic than others. The broker leading the hot partitions becomes the bottleneck.

Detection: per-broker bytes-in metric. The hot one is 5x the median.
Mitigation:

Move hot partitions' leaders to other brokers (preferred-leader election, manual reassignment).
If a single partition is the hotspot: consider increasing partition count (carefully - breaks key-based ordering).
Tune the producer to spread load (avoid sticky-partitioning hotspots).

Disk full
Kafka brokers don't tolerate disk full gracefully. Writes fail; broker may crash.

Prevention:

Monitor disk usage. Alarm at 80%.
Automate retention adjustments (shorten retention if usage spikes).
Use multiple log directories (log.dirs) - if one fills, broker continues on others.

ZooKeeper / KRaft considerations
ZooKeeper era: 3-5 ZK nodes for HA. ZK failures take down the whole Kafka control plane. Operationally separate cluster.

KRaft era: controllers run as a Raft quorum on dedicated nodes (or co-located on small clusters). Eliminates ZooKeeper. Migration from ZK to KRaft is supported in modern Kafka but non-trivial; plan as a multi-month project.

Capacity planning
Provision for:

Peak ingest rate × 1.5 (headroom).
Replication factor 3 (at least).
30% disk headroom for compaction / reassignment.
2x network bandwidth headroom (inter-broker traffic during rebalance).

Reserve capacity is real money. A right-sized cluster has spare nodes, not maxed-out ones.

Trade-offs

Partition count
More partitions = more parallelism + more overhead. Set based on peak consumer parallelism × 2-3, not "as many as possible".

Replication factor
RF=3 with min.insync.replicas=2 is the standard. Tolerates one broker loss without unavailability or data loss. RF=2 risks data loss; RF=4+ wastes storage.

acks=1 vs acks=all
acks=1 is faster (lower latency, higher throughput); risks data loss on leader failure before replication. acks=all is safer; mandatory for any non-trivial use case.

Pull vs push consumer model
Kafka chose pull. Consumers control rate; back-pressure is natural. Push (RabbitMQ, traditional MQs) gives lower latency but consumers can be overwhelmed.

Compacted vs delete-only retention
Compacted topics hold "latest value per key" - good for state. Delete-only topics hold "history of events" - good for streams. Pick by access pattern.

Kafka vs SQS / SNS
Kafka: high throughput, replay, ordered, durable retention. Operational burden if self-managed. SQS/SNS: managed, simpler, lower throughput ceiling, no replay. Use SQS/SNS for task queueing; Kafka for streams/event sourcing.

KRaft vs ZooKeeper
KRaft is the future. New clusters start on KRaft. Existing ZK clusters migrate when due for major upgrade.

Tiered storage vs longer hot retention
Tiered cuts cost 5-10x for old data with slight read latency for cold. Always enable for retention >7 days at scale.

Self-managed vs MSK / Confluent Cloud
Managed costs 2-3x more but eliminates the ops burden. Crossover point: ~50 brokers or a dedicated 2-engineer Kafka team. Below: managed. Above: self-host with hardened automation.

Single cluster vs per-team clusters
Single: shared infra, lower cost, harder isolation. Per-team: blast radius isolation, higher overhead. Most companies have a few clusters segmented by domain (transactional, analytics, edge), not strictly per-team.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

?How would you increase partition count on a topic with strict per-key ordering?
?What changes if you need cross-region active-active replication?
?How do you handle a broker that's GC-pausing for 10 seconds repeatedly?
?What's your strategy when consumer lag is growing despite max consumers?
?How would you migrate a 50-broker cluster from ZooKeeper to KRaft?
?How do you implement exactly-once when the consumer's side effect is a SQL write?
?What's your story for replaying a year-old event without melting the cluster?
?How do you isolate a noisy tenant from impacting other tenants on a shared cluster?

Companies that test this topic

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →