Top 50 Kafka Interview Questions in 2026 (With Answers and Code)
Kafka is the default event backbone for any company past a certain scale. Even shops that "don't need streaming" usually have Kafka somewhere - CDC, audit, microservice events, observability pipelines. Every backend, platform, and data engineering interview in 2026 includes Kafka questions.
These 50 questions are the ones that actually came up. Compact answers, focus on what interviewers want to hear.
Core Concepts (1-10)
1. What is Apache Kafka?
A distributed, partitioned, replicated log. Producers append messages to topics; consumers read at their own pace. Built for high throughput, durability, and horizontal scale.
2. Topic, partition, offset?
A topic is a named stream of messages. A topic is split into partitions - the unit of parallelism and ordering. Each message in a partition has a monotonically increasing offset. Order is guaranteed within a partition, not across partitions.
3. What is a broker?
A Kafka server. A cluster is a group of brokers. Each partition lives on one broker (the leader) and is replicated to others (followers).
4. Producer vs consumer?
Producers append messages to topics. Consumers read messages from topics. They're decoupled - producers don't know who reads, consumers don't know who writes.
5. What is a consumer group?
A group of consumers that cooperatively consume a topic. Kafka assigns partitions to group members so each partition is read by exactly one consumer in the group. Adding consumers up to the partition count scales reads; beyond that they sit idle.
6. What is the rebalance protocol?
When a consumer joins or leaves the group, partitions are reassigned. The cooperative sticky assignor (modern default) minimizes movement and lets unaffected consumers keep working during the rebalance.
7. What's the difference between a queue and Kafka?
Traditional queues delete messages after consumption. Kafka retains messages by time or size; multiple consumer groups can replay the same data independently. It's a log, not a queue.
8. How does Kafka guarantee ordering?
Within a partition only. Messages with the same key always go to the same partition (default partitioner) so per-key order is preserved.
9. What is the partitioner?
The function that decides which partition a message goes to. Default: hash(key) % numPartitions if a key exists, otherwise sticky/round-robin. Custom partitioners are rare but useful for skew control.
10. ZooKeeper vs KRaft?
Pre-3.x Kafka used ZooKeeper for cluster metadata. KRaft (Kafka Raft) replaces it - Kafka manages its own metadata using a Raft consensus quorum among controller nodes. Production-ready since Kafka 3.3, default for new clusters in 4.x. New deployments in 2026 should be KRaft-only.
Producer Behavior (11-20)
11. What are acks settings?
How many replicas must acknowledge a write before the producer considers it sent.
acks=0: fire and forget. Possible message loss.acks=1: leader acknowledges. Loss possible if leader dies before replicating.acks=all(or-1): all in-sync replicas acknowledge. Strongest durability.
Production durability requires acks=all plus a min in-sync replica count.
12. What is min.insync.replicas?
How many replicas must be in sync to accept writes when acks=all. Set to replication_factor - 1. With RF=3, set ISR=2: tolerate one broker loss, refuse writes if two are down.
13. What is the idempotent producer?
enable.idempotence=true. The broker dedupes producer retries using a producer ID and sequence number. Eliminates duplicates from retries within a single producer session. Default-on in modern clients.
14. What are transactions?
Producer transactions span multiple topic-partitions. With transactional.id set, the producer can beginTransaction, send to many partitions, and commitTransaction atomically. Consumers read committed messages only when isolation.level=read_committed.
15. What is "exactly once" really?
End-to-end exactly-once requires:
- Idempotent producer + transactions on the producer side.
read_committedconsumers.- Atomic offset commits inside the producing transaction (Streams handles this; Connect can in EOS mode).
It's exactly-once for the Kafka-to-Kafka path. Pushing data to a third system (DB, S3) requires the sink to be idempotent or transactional too.
16. What is linger.ms?
How long the producer waits to fill a batch before sending. Higher linger.ms = bigger batches = better throughput but worse latency. Production tuning: 5-50ms is common.
17. batch.size vs linger.ms?
The producer sends a batch when EITHER it reaches batch.size bytes OR linger.ms time elapses. Tune both together for your throughput/latency target.
18. What is compression?
The producer can compress batches before sending (compression.type). zstd is the default in modern Kafka - best ratio. lz4 for CPU-bound producers. Compression is end-to-end - brokers store and forward compressed batches.
19. What happens when a partition leader dies mid-write?
If acks=all and the message reached the in-sync replicas, no loss - one replica becomes the new leader. If it only reached the old leader, it's lost. The producer retries with the new leader.
20. What is max.in.flight.requests.per.connection?
Number of unacknowledged batches the producer will pipeline. Higher = better throughput. With idempotence enabled, can stay >1 (max 5) without reordering.
Consumer Behavior (21-30)
21. Auto-commit vs manual commit?
enable.auto.commit=true commits offsets in the background every 5s by default - simple but can cause data loss or reprocessing on crashes. Production code uses manual commits (commitSync after processing) for control.
22. commitSync vs commitAsync?
commitSync blocks until the broker confirms - safer at shutdown. commitAsync is non-blocking - faster in steady state. Common pattern: commitAsync while running, commitSync once at shutdown.
23. What is offset reset policy?
auto.offset.reset controls behavior when there's no committed offset. earliest reads from the start of the topic, latest reads from new messages only. Pick deliberately - getting it wrong replays everything or skips data.
24. What is a consumer lag?
Latest offset minus the consumer's committed offset. If lag grows continuously, the consumer can't keep up. Monitor it - this is the most important consumer metric.
25. How do you scale a slow consumer?
- Add more consumers (up to partition count).
- Increase partition count if you've maxed out and still lag.
- Process in parallel within a consumer (carefully - breaks per-key ordering).
- Use a worker pool: consumer thread reads, worker threads process by key.
26. What is max.poll.records?
How many records poll() returns at once. Higher = more throughput, longer per-poll processing. If processing exceeds max.poll.interval.ms, the consumer is kicked from the group.
27. What's the consumer heartbeat?
A background thread that pings the group coordinator. If session.timeout.ms passes without heartbeats, the consumer is considered dead and rebalance kicks off.
28. What is read_committed vs read_uncommitted?
read_uncommitted (default) returns all messages, including aborted transactional writes. read_committed filters out aborted transactions and waits for in-progress ones to commit. Required for exactly-once consumers.
29. What is partition assignment strategy?
Range, round-robin, sticky, cooperative-sticky. Modern default: cooperative-sticky - minimizes partition movement during rebalance and avoids the "stop the world" pause.
30. How do you avoid double-processing on consumer crashes?
Either:
- Make processing idempotent so duplicates are harmless.
- Use Streams or transactional outbox patterns to commit offsets atomically with the side effect.
- Use a dedupe store (Redis/DB) keyed by partition+offset.
Storage, Replication, Operations (31-40)
31. How does replication work?
Each partition has a leader and follower replicas. Followers fetch from the leader. Replicas that are caught up are "in sync" (ISR). On leader failure, a new leader is elected from the ISR.
32. What is the ISR (in-sync replica) set?
Replicas within replica.lag.time.max.ms of the leader. Only ISR members can be elected leader (with default unclean.leader.election.enable=false).
33. What is unclean leader election?
Allowing an out-of-sync replica to become leader when the ISR is empty. Trades durability for availability - data loss is possible. Default off; turn on only if availability matters more than data integrity.
34. Log retention - time vs size?
retention.ms (default 7 days) and retention.bytes per partition. Whichever triggers first. Some topics use compaction instead.
35. What is log compaction?
Per-key retention: keep only the latest message for each key. Used for changelogs, KTables, configuration topics. Set cleanup.policy=compact.
36. What's a tombstone?
A message with a key and null value. In a compacted topic, it signals "delete this key" - the key is removed after the delete retention period.
37. Tiered storage - what is it?
Modern Kafka (3.6+) can offload old log segments to S3/GCS while keeping recent data on local disk. Cuts storage cost dramatically for long-retention or replay-heavy workloads.
38. How do you add partitions safely?
kafka-topics --alter --partitions N. Caveat: existing keys may now hash to a different partition, so per-key ordering across the change point is broken. For ordering-sensitive topics, plan partition counts up front and avoid resizing.
39. How do you reassign partitions?
kafka-reassign-partitions.sh generates and executes a plan. Used to rebalance after adding brokers or to move partitions off a hot/dying broker. Throttle reassignment to avoid saturating the network.
40. What metrics matter for a Kafka cluster?
- Under-replicated partitions (should be 0).
- ISR shrinks/expands per minute (low and stable).
- Request latency (p99) on producer/consumer.
- Consumer lag per group.
- Disk usage and network I/O per broker.
Streams, Connect, and Modern Patterns (41-50)
41. What is Kafka Streams?
A JVM library for stateful stream processing. Reads from Kafka, processes (map, filter, join, aggregate), writes back to Kafka. Each instance owns partitions; state lives in local RocksDB and is backed up to a changelog topic.
42. KStream vs KTable?
A KStream is an unbounded stream of facts (every event matters). A KTable is a changelog view of state per key (only the latest value matters). Joins differ: KStream-KStream are time-windowed, KTable joins are stateful "current value" lookups.
43. What is ksqlDB?
SQL on top of Kafka Streams. Define streams and tables with CREATE STREAM / CREATE TABLE, run continuous SQL queries that produce derived topics. Good for analyst/SQL-fluent users; less flexible than Streams in code.
44. What is Kafka Connect?
A framework for moving data in and out of Kafka without writing code. Sources read from external systems (DBs via Debezium, S3, JDBC) into topics; sinks write topics to external systems. Distributed worker mode for fault tolerance.
45. What is CDC (change data capture)?
Replicating database changes as a stream. Debezium reads MySQL/Postgres replication logs and produces row-level events to Kafka. Standard pattern in 2026 for keeping caches, search indexes, and warehouses in sync with operational DBs.
46. What is the transactional outbox pattern?
Write a row to an outbox table in the same DB transaction as your business write, then have CDC publish the outbox to Kafka. Solves "write to DB, publish to Kafka" atomicity without 2PC.
47. What are tiered storage and remote log management used for?
Long retention without local disk pressure. You can keep months of data in S3 and only days locally. Reads transparently fetch from remote storage. Game-changer for replay-heavy event-sourcing workloads.
48. Schema Registry - why?
Stores Avro/Protobuf/JSON Schemas centrally. Producers serialize with schema ID; consumers fetch the schema by ID and deserialize. Enforces compatibility rules across producers and consumers. Confluent and Apicurio are the common implementations.
49. How would you pick partition count?
Throughput per partition (read by one consumer) is your unit. Estimate target throughput / per-partition-throughput. Don't go wildly higher - more partitions = more open file handles, longer rebalances, longer leader elections. 50-200 per topic is common.
50. When should you NOT use Kafka?
- Low-volume RPC ("we need messages") - use HTTP or gRPC.
- Strict request/response patterns - Kafka is async by design.
- Tiny teams without ops capacity for a stateful distributed system - SQS/PubSub may be enough.
- Sub-millisecond latency requirements - you can get there with Kafka but it's tuning territory.
If you want decoupling, replay, and high throughput durable streaming, it's hard to beat.
How to Use These
Get hands-on. Run a single-node Kafka in Docker, write a producer/consumer, kill brokers and watch failover, replay topics. The questions interviewers really ask aren't trivia - they're "what would you do when X breaks." Build the intuition by breaking it yourself.
Learn the failure modes more than the happy path. Replication, ISR, rebalances, lag - these are where Kafka knowledge separates someone who's used it from someone who's read about it.