We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Partitions, consumer groups, replication, retention, and the exactly-once myth - the implementation details Kafka users gloss over until they don't.
Design a distributed message queue that ingests millions of messages per second, durably stores them across machine and data-center failures, lets independent consumer groups read at their own pace, supports replay, and provides ordering guarantees within a partition.
This is "implement Kafka" framed as a system design question. Strong candidates separate the on-disk log structure, the replication protocol, the consumer offset model, and the operational concerns (rebalances, partition rebalancing, broker failures). Weak candidates draw a queue and stop. The signal is depth - everyone knows what Kafka does; few know how.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Storage
Throughput
Network
Memory
Connections
The cluster has three roles: brokers (store and serve the log), controllers (cluster-state coordinator), and clients (producers and consumers).
A topic is partitioned for parallelism. Each partition is a log file (actually a sequence of segment files) on a leader broker, replicated to N-1 follower brokers. Producers send to the leader; followers fetch and append. Consumers read from the leader by default (or from followers in newer versions for locality).
The consumer group is the parallelism unit on the read side: a group of N consumers shares the work of reading a topic, with each partition assigned to exactly one consumer in the group. Offsets are tracked per (group, partition) on the cluster itself.
The defining performance trick: append-only log on disk + page cache + zero-copy from disk to socket = throughput limited by network, not CPU.
Stores partition replicas. Handles producer appends and consumer fetches. Participates in replication protocol with other brokers.
One broker elected as the cluster coordinator. Manages partition leadership assignments, ISR membership, topic creation/deletion. In KRaft mode, controllers run on a separate quorum.
Holds cluster state: brokers, topics, partitions, leaders, ISR. KRaft (Kafka Raft) replaces ZooKeeper. Strong consistency required.
Append-only log on disk, split into segment files (typically 1GB each). Index files map offsets to byte positions. Old segments deleted or compacted per retention policy.
Client library. Batches records. Computes partition (key hash or round-robin). Sends to partition leader. Handles retries and idempotency keys.
Client library. Joins a consumer group. Pulls records from assigned partitions. Commits offsets. Handles rebalances.
A broker designated per consumer group. Tracks membership, assigns partitions, stores committed offsets in an internal __consumer_offsets topic.
Followers fetch from leader using a long-poll protocol. Leader tracks high-watermark = highest offset replicated to all in-sync replicas. Consumers only see records below the high-watermark.
Old segments offloaded to S3 / object storage. Brokers serve them on demand from cold tier. Cuts hot storage cost 5-10x for long-retention topics.
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
Partitions are the central abstraction. Get them right and the system scales; get them wrong and you can't fix it cheaply.
What partitions buy you
Partition count is not free to change
Adding partitions is allowed but breaks key-based partitioning - existing keys may now hash to different partitions, breaking per-key ordering. Removing partitions is not supported online.
Rule of thumb: provision 2-5x more partitions than current peak consumer parallelism. Most clusters live with 20-50 partitions per topic for moderate-volume topics, hundreds for high-volume.
Partition key choice
Hot partition mitigation
A celebrity user dominates a partition. Mitigations:
Per-broker partition limit
A broker holds 1000-4000 partitions in practice. Beyond that, ZooKeeper / KRaft pressure, leader election storms on broker restart. A 200-broker cluster can hold ~500K-800K partitions safely.
Partition rebalancing
Adding/removing brokers triggers partition movement. The controller plans new assignments; data is replicated to the new leader; old replica is deleted. Reassignment is bandwidth-intensive (TBs flowing internally) and can be throttled to avoid impacting clients.
The partition count anti-pattern
"More partitions = better parallelism" - true to a point. Each partition has overhead: file handles, replication threads, controller metadata. 100K-partition topics are operationally painful. Pick the count for actual peak parallelism + a safety margin, not "as many as possible".
Durability comes from replicating each partition across N brokers. The ISR mechanic decides when a write is "committed".
Leader and followers
Each partition has one leader and N-1 followers. Producers write to the leader. Followers fetch from the leader continuously. Followers that have fetched up to the leader's tail recently are "in sync".
The ISR set
The set of replicas (including leader) currently caught up. Configurable lag threshold (replica.lag.time.max.ms, default 30s). A follower that falls behind is removed from ISR.
The high-watermark
The highest offset present in all ISR members. Consumers only see records up to the high-watermark. This is the durability guarantee: a consumer never sees a record that isn't replicated to all ISR members.
Producer ack levels
The min.insync.replicas knob
Combined with acks=all, this guarantees the data is on at least M brokers before ack. Standard config: replication factor 3, min.insync.replicas=2. Tolerates one broker loss without data loss.
If ISR drops below min.insync.replicas, the leader rejects writes (fails-closed - prefer unavailability over data loss).
Leader election
Leader fails. Controller picks a new leader from the ISR. Strict election: only ISR members eligible. If ISR is empty, choices:
Cross-AZ / cross-region replication
Replication failure modes
The consumer group model is elegant on paper and operationally painful in practice. Understanding rebalances is the difference between Kafka and Kafka-as-incident.
Group semantics
A consumer group is a logical reader. The group name identifies it. Each partition is assigned to exactly one consumer in the group. If the group has N consumers and the topic has P partitions:
Multiple groups can read the same topic independently - each gets its own offset.
Offset storage
Committed offsets stored in __consumer_offsets, an internal compacted topic. Each commit writes (group, topic, partition) → offset.
Compaction keeps only the latest offset per (group, topic, partition). The offset topic stays bounded.
Commit strategies
The right pattern: process, then sync-commit periodically (every N records or T seconds). Combine with idempotent processing.
Rebalance triggers
During rebalance, all consumers in the group pause - the "stop the world" problem.
Rebalance protocols
Modern Kafka clients use cooperative rebalancing. Always upgrade.
Static group membership (KIP-345)
Each consumer has a stable group.instance.id. A restart within session.timeout.ms doesn't trigger a rebalance - the consumer rejoins with the same identity. Reduces rebalance churn from rolling deploys dramatically.
The session timeout knob
session.timeout.ms (default 45s). Heartbeat must arrive within this window or the consumer is presumed dead. Too short: false positives during GC pauses. Too long: dead consumers hold partitions.
Consumer lag
The KPI. lag = log_end_offset - committed_offset. Alarm > 1 minute lag for real-time use cases. Lag growing means production rate exceeds consumption rate; scale up consumers (up to partition count).
The "consumers stuck" pattern
A poison-pill record (one the consumer crashes on) blocks the partition. Consumer crashes, restarts, hits same record, crashes again. Detection: lag growing, consumer logs show repeated crashes. Fix: skip the record (advance offset manually) and route to a DLQ for offline investigation.
Kafka's exactly-once is real, but narrower than the marketing implies.
Where it works (read-process-write within Kafka)
Producer emits to topic A. Consumer reads from A, transforms, writes to topic B. Consumer commits its A-offset.
The two writes (to B and to __consumer_offsets) happen in a Kafka transaction. Either both succeed or neither does. No partial state visible to other consumers.
This works because the consumer's "side effect" is itself a Kafka write. Both can be transacted together.
The transactional producer
Idempotent producer (KIP-98)
Even without transactions, producers can be idempotent. Each record has a sequence number per (producer_id, partition). Broker rejects duplicates with the same sequence. Eliminates network-retry-induced duplicates.
Always enable enable.idempotence=true. No downside.
Where exactly-once fails: external side effects
Consumer reads from Kafka, calls an external HTTP API. Now you have two write operations in different systems. No global transaction.
Approaches:
The honest contract
Kafka offers: at-least-once delivery, idempotent producers, transactional writes within Kafka. With careful patterns, you build effectively-once for end-to-end pipelines. "Exactly once" as a protocol guarantee across systems doesn't exist.
Consumer isolation level
Performance cost
Transactions add ~5-10ms producer latency (extra coordinator round-trips). Idempotence adds ~1-2ms. Both have negligible throughput impact when batched.
The on-disk log isn't infinite. Three policies decide what stays.
Time-based retention
log.retention.hours (or .ms). Default 168 (7 days). Segments older than this are deleted.
Use case: event logs, where old data has no value. Most topics.
Size-based retention
log.retention.bytes per partition. When exceeded, oldest segments deleted.
Use case: bounded-storage topics where time-based retention would blow disk on bursty topics.
Combined retention: deletion happens when either policy is violated.
Log compaction
For state-snapshot topics. Compaction keeps the latest record per key; older records with the same key are deleted.
Use case: __consumer_offsets, change data capture (CDC) streams, configuration topics. Anywhere "the value at this key" matters more than "the history".
How: a background thread scans segments. For each key, finds the latest offset; older records are eligible for deletion. Tombstones (records with null value) signal deletion - kept for delete.retention.ms (default 24h) so consumers can see the delete, then GC'd.
Compaction uses extra disk briefly while running (hits up to 2x the topic's size). Not free.
Mixed retention (compact + delete)
cleanup.policy=compact,delete. Compact within retention window; delete records older than the window. Use case: change-data-capture where snapshot is needed but ancient history isn't.
Tiered storage (KIP-405)
Hot tier: recent segments on broker SSD.
Cold tier: older segments offloaded to S3 / GCS.
When a consumer reads an offset in cold tier, the broker fetches the segment on demand. Slower (object storage latency) but cheap (~$25/TB-month vs SSD).
Cuts hot storage cost 5-10x for long-retention topics. Critical for use cases like compliance retention (years) or replay windows.
The retention-vs-cost trade-off
Compaction lag and the "stale read" problem
Compaction runs on a schedule, not synchronously. A topic that's been compacted recently might still have old records of a key visible. Consumers that need "current value only" must read all records and dedupe - the topic itself doesn't guarantee freshness, only eventual consistency.
Running Kafka at scale is mostly an operations problem. The protocol is well-defined; the failure modes are not.
Rolling restart
Replace JVM, kernel, or config on each broker without downtime.
Sequence per broker:
Time per broker: 5-30 minutes depending on data size and ISR catchup speed. A 200-broker cluster takes most of a day.
Broker decommission
Replacing a dead broker or downsizing.
A full broker's worth of data movement: 10-50 TB at 100 MB/s = 1-5 days. Plan ahead.
Hot broker problem
Some partitions take more traffic than others. The broker leading the hot partitions becomes the bottleneck.
Detection: per-broker bytes-in metric. The hot one is 5x the median.
Mitigation:
Disk full
Kafka brokers don't tolerate disk full gracefully. Writes fail; broker may crash.
Prevention:
ZooKeeper / KRaft considerations
ZooKeeper era: 3-5 ZK nodes for HA. ZK failures take down the whole Kafka control plane. Operationally separate cluster.
KRaft era: controllers run as a Raft quorum on dedicated nodes (or co-located on small clusters). Eliminates ZooKeeper. Migration from ZK to KRaft is supported in modern Kafka but non-trivial; plan as a multi-month project.
Capacity planning
Provision for:
Reserve capacity is real money. A right-sized cluster has spare nodes, not maxed-out ones.
Partition count
More partitions = more parallelism + more overhead. Set based on peak consumer parallelism × 2-3, not "as many as possible".
Replication factor
RF=3 with min.insync.replicas=2 is the standard. Tolerates one broker loss without unavailability or data loss. RF=2 risks data loss; RF=4+ wastes storage.
acks=1 vs acks=all
acks=1 is faster (lower latency, higher throughput); risks data loss on leader failure before replication. acks=all is safer; mandatory for any non-trivial use case.
Pull vs push consumer model
Kafka chose pull. Consumers control rate; back-pressure is natural. Push (RabbitMQ, traditional MQs) gives lower latency but consumers can be overwhelmed.
Compacted vs delete-only retention
Compacted topics hold "latest value per key" - good for state. Delete-only topics hold "history of events" - good for streams. Pick by access pattern.
Kafka vs SQS / SNS
Kafka: high throughput, replay, ordered, durable retention. Operational burden if self-managed. SQS/SNS: managed, simpler, lower throughput ceiling, no replay. Use SQS/SNS for task queueing; Kafka for streams/event sourcing.
KRaft vs ZooKeeper
KRaft is the future. New clusters start on KRaft. Existing ZK clusters migrate when due for major upgrade.
Tiered storage vs longer hot retention
Tiered cuts cost 5-10x for old data with slight read latency for cold. Always enable for retention >7 days at scale.
Self-managed vs MSK / Confluent Cloud
Managed costs 2-3x more but eliminates the ops burden. Crossover point: ~50 brokers or a dedicated 2-engineer Kafka team. Below: managed. Above: self-host with hardened automation.
Single cluster vs per-team clusters
Single: shared infra, lower cost, harder isolation. Per-team: blast radius isolation, higher overhead. Most companies have a few clusters segmented by domain (transactional, analytics, edge), not strictly per-team.
Be ready for at least three of these. The first one is almost always asked.
Fan-out at write vs read, at-least-once vs exactly-once, dead-letter queues, and the multi-channel delivery problem - one message, ten failure modes.
Batch vs streaming, lambda vs kappa, the warehouse-vs-lakehouse decision, and dimension modeling that survives schema drift.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →