We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Fan-out at write vs read, at-least-once vs exactly-once, dead-letter queues, and the multi-channel delivery problem - one message, ten failure modes.
Design a notifications platform that ingests events from many producers and delivers them to many consumers across multiple channels (in-app, push, email, SMS, webhook). Producers don't know who the consumers are; consumers don't know who the producers are. The platform handles delivery guarantees, retries, user preferences, throttling, and back-pressure invisibly.
This problem has two layers: the pub-sub messaging substrate (Kafka / SNS / SQS) and the notifications product on top (per-user fan-out, channel selection, preferences, deliverability). Strong candidates separate these concerns explicitly and discuss the trade-offs at each layer.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Storage
Throughput
Network
Connections (in-app delivery)
Provider rate limits
The system is a layered pipeline: event ingestion → durable log → fan-out engine → per-channel delivery workers → external providers.
The defining design decision is the fan-out strategy: pre-compute deliveries at write time (fanout-on-write, like a feed), or compute at read time (fanout-on-read, lazy). Notifications usually want fanout-on-write for latency (push notifications can't wait for the user to ask).
The defining operational concern is back-pressure - when a downstream provider degrades (Twilio is rate-limiting us, SES is bouncing), the event log keeps filling up. The architecture must absorb this without losing events or spilling failure into upstream producers.
POST /events. Validates schema, assigns event_id (UUID), enriches with source metadata, publishes to the durable log. Returns 200 immediately - delivery is asynchronous.
The source of truth. Topic per event-type or per-tenant. Partitioned for parallelism (typically by user_id for ordering within a user). Replicated 3x. Retention: 7 days hot.
Maps (topic, filter) → list of subscribed user_ids. Updated when users change preferences. Cached aggressively. Also stores per-user channel preferences and quiet hours.
Consumes the event log. For each event, looks up subscribers, applies preference filters, produces per-(user, channel) delivery tasks onto channel-specific queues.
One queue per channel (in-app, push, email, SMS, webhook). Decoupling here enables per-channel back-pressure, retry policies, and worker scaling.
Per-channel worker pools. Push workers talk to APNs/FCM. Email workers talk to SES. Each worker handles channel-specific batching, formatting, retry semantics.
Holds persistent WebSocket connections to user devices. Receives in-app delivery tasks, pushes to the right device. Falls back to push notification if disconnected.
Permanently failed deliveries. Reviewed by ops; replayed manually after upstream fixes; analyzed for systemic failure patterns.
Per-user state: subscription list, channel preferences, quiet hours, frequency caps ("no more than 3 marketing emails/day"). Consulted by fan-out before producing deliveries.
Async. Tracks delivery state per (event, user, channel). Powers user-facing 'notification history' and ops-facing dashboards (delivery rates, latency, failure types).
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
The core architectural decision. It mirrors the news-feed problem but with sharper trade-offs because notifications usually require push (not pull).
Fanout-on-write (push model)
When an event arrives, immediately compute the recipient list and produce a delivery task per recipient. Writes scale with subscriber count.
Pros: low delivery latency (no waiting for the user to poll). Push notifications work (user doesn't need to be online).
Cons: write amplification = avg subscribers per event. A broadcast event (1M subscribers) = 1M deliveries to compute and queue.
Fanout-on-read (pull model)
Store events centrally; each consumer queries on connect for "events I missed since last poll". Writes are constant; reads are expensive.
Pros: minimal write amplification. Cleanly handles late subscribers.
Cons: requires polling. Latency = polling interval. Doesn't support push.
Hybrid (the production answer)
The broadcast optimization
Naive: 1 broadcast event × 100M users = 100M delivery tasks. Wasted work.
Smart: write the event to a "broadcasts" queue. All gateway nodes subscribe. Each gateway iterates its connected users (~10K each), filters per-user preferences, pushes. Total work: 1 read per gateway × 5K gateways = 5K reads instead of 100M.
Late subscribers
A user comes online after the broadcast was sent. They missed it. Solutions:
The interview signal: candidates who name both models and explain when each applies score higher than candidates who default to one.
Distributed messaging promises three flavors of delivery. Only one is practical at scale.
At-most-once
The producer sends; if anything goes wrong, the message is lost. Used by UDP-style protocols where loss is acceptable (logging, metrics). Not suitable for notifications - users miss real events.
At-least-once (the standard)
Every message is delivered one or more times. Duplicates happen on retries (network failures, broker restarts, consumer crashes after processing but before commit). Consumers must be idempotent on event_id.
This is what Kafka, SQS, RabbitMQ, NATS, and every other production messaging system actually delivers. Anyone who claims "exactly-once" without significant qualification is selling something.
Exactly-once (almost a marketing claim)
The message is delivered exactly once. Mathematically requires distributed transactions across producer, broker, and consumer.
Kafka has "exactly-once semantics" (EOS) via transactional producer + idempotent consumer + transactional commit of offsets. It works only when:
The moment your consumer's side effect is "send an email" or "POST to a webhook", exactly-once is back to at-least-once + idempotency on the consumer.
Idempotency in practice
Each event has a globally unique event_id (UUID). Consumers maintain a recent-events bloom filter or KV store of "events I've already processed". Duplicates are detected and skipped.
Idempotency window: how far back to deduplicate. 24 hours is typical (matches the worst-case retry tail). Storage: ~100M events/day × 16 bytes (UUID hash) = 1.6 GB. Trivial.
Where duplicates come from
The real-world contract
Kafka guarantees: "at-least-once delivery within retention; ordered within a partition; durable across N replicas". That's the truth. Build idempotent consumers; live happily.
Each channel has its own protocol, its own provider, its own rate limits, and its own failure modes. The architecture must isolate them.
In-app (WebSocket gateway)
Persistent connection to the user's device. On delivery: lookup connection, write to socket. If user has multiple devices, deliver to all. If no active connection, mark for push fallback.
Latency: sub-second when connected.
Failure mode: disconnected user. Fallback: enqueue push.
Push (APNs / FCM)
HTTP/2 connection to APNs (iOS) or FCM (Android). Per-device tokens stored in the user's device registry.
Latency: typically <5 seconds globally; APNs is fast.
Failure modes:
Email (SES, SendGrid, Mailgun)
SMTP submission to the provider. Per-recipient.
Latency: 5-60 seconds typically; can be longer if the recipient's mailbox is slow.
Failure modes:
SMS (Twilio, AWS SNS)
HTTP API to provider. Per-recipient. Most expensive channel.
Latency: 5-30 seconds typically.
Failure modes:
Webhook (HTTP POST to merchant endpoint)
For B2B notifications (Stripe-style). HMAC-signed payload.
Latency: depends on merchant endpoint.
Failure modes:
Per-channel queues
Each channel has its own queue and worker pool. A Twilio outage doesn't back-pressure email delivery. A spike in webhook deliveries doesn't crowd out push.
Channel selection logic
Per-user preferences determine which channels deliver which event types. "Mention" → in-app + push. "Daily digest" → email only. "Critical alert" → all channels. Selection happens in the fan-out stage, before queueing.
When a downstream channel degrades, the messaging substrate must absorb the failure without losing data or cascading to producers.
The failure scenario
Twilio's API starts rate-limiting us (or returns 5xx). Our SMS workers keep retrying; the queue grows; eventually it overflows.
Back-pressure at the queue layer
Per-channel queues have a configurable max depth (e.g., 1M messages). When approaching the limit:
The event log is the buffer that absorbs the back-pressure. As long as Kafka has retention, no events are lost - SMS deliveries are simply delayed.
Dead-letter queue (DLQ)
A delivery that fails after N retries (e.g., 7 attempts over 24 hours) goes to the DLQ. The message is preserved with the failure reason; it's not lost.
DLQ workflow:
The DLQ is the safety net. Without it, transient failures permanently lose messages.
Selective shedding
At extreme load, the platform may need to drop low-priority deliveries to preserve high-priority ones. Examples:
Shedding policies live in the fan-out service. They consult per-event priority and current queue depths.
Circuit breakers
Each channel worker pool has a circuit breaker. If the provider's error rate exceeds threshold (e.g., 50% errors in 1 min), open the breaker - stop sending for a cool-off period. Re-test with a small probe; close the breaker when healthy.
This prevents pile-on (sending more requests when the provider is already struggling) and allows the provider to recover.
Cascading failure prevention
The architecture explicitly does NOT couple producer success to delivery success. Producers see 200 the moment the event is durably logged in Kafka. If every downstream channel is down, producers are unaffected.
This decoupling is non-negotiable. Coupling producer to delivery means a Twilio outage takes down our event ingestion.
Users want control over what they receive. The platform must enforce preferences before delivery, not after - sending and then asking forgiveness damages the brand and risks regulatory exposure.
Preference model
Per (user, topic, channel): allow / deny.
user_42:
topic="security_alert", channel="push": allow
topic="security_alert", channel="email": allow
topic="marketing", channel="email": allow
topic="marketing", channel="push": deny
topic="social_mention", channel="*": allow
Stored in DynamoDB / Redis. Looked up in the fan-out path.
Default preferences
New users get sensible defaults: transactional notifications all-channels, marketing email-only, social in-app-only. Users opt out of categories; rarely opt in to ones they don't get by default.
Quiet hours
"Don't send push notifications between 10pm-7am in the user's timezone". Implemented by checking the user's local time at delivery; defer to the next open window.
Edge case: a "critical" event (security alert) overrides quiet hours. The product team decides which categories are override-eligible.
Frequency caps
"No more than 3 marketing emails per week". Implemented by maintaining a per-user, per-category counter with a sliding window. At fan-out: check counter; if over cap, drop or queue for next window.
This is a mini rate-limiter built into the notifications path.
Bundling / digest mode
Some users prefer "send me a summary at 5pm" instead of real-time. The fan-out service writes events to a per-user pending-bundle table. A scheduled job (cron at 5pm per timezone) compiles the bundle and sends one email.
Compliance hooks
The preference service owns this state and is authoritative. Other services consult it; never bypass.
Subscription management UI
Users access /settings/notifications. Listed by topic; each topic has a per-channel toggle. Saved settings flow to the preference service via API; immediately effective on next event.
Enterprise / admin overrides
For B2B products, administrators may force certain notifications regardless of user preferences (security alerts, billing notices). The preference service has a "mandatory" flag per (org, topic) that overrides user toggles for org-internal events.
Kafka's central trick: partitions provide parallelism for free, but ordering is preserved only within a partition. The choice of partition key shapes the entire system.
Partition by what?
For notifications, partition by user_id is the standard. Per-user ordering matters (notifications about a single conversation should arrive in order); cross-user ordering doesn't.
Hot partition mitigation
A celebrity user (millions of incoming notifications) creates a hot partition. Mitigation:
Consumer parallelism
A topic with 100 partitions can be consumed by up to 100 parallel consumers. Adding more consumers doesn't help (idle consumers). Setting partition count requires forecasting peak parallelism needed; it's painful to change after the fact.
Rule of thumb: 2-3x more partitions than expected peak consumer count. A topic expected to peak at 50 consumers gets 100-150 partitions.
Consumer groups and offset management
Multiple workers form a consumer group; Kafka assigns partitions to workers within the group. Each worker tracks its committed offset.
Failure recovery: a worker dies; Kafka reassigns its partitions to surviving workers; they resume from the last committed offset. Some duplicates possible (events processed but not committed) - hence at-least-once + idempotency.
Rebalance pauses
When the consumer group changes (worker added/removed), Kafka rebalances - all consumers pause briefly while reassignment happens. Modern Kafka (cooperative rebalancing) reduces pause to single-partition impact instead of group-wide.
Lag monitoring
The KPI for any pub-sub system is "consumer lag" - how far behind real-time the consumer is. Alert at >1 minute lag for in-app channels, >10 minutes for email.
Lag is the integral over time of (production rate - consumption rate). When production exceeds consumption, lag grows linearly. Alarm immediately - by the time you notice manually, you're hours behind.
Consumer scaling
Lag growing → add consumers (up to partition count). If at partition count, the only options are:
Fanout-on-write vs fanout-on-read
Fanout-on-write is the default for low-latency push. Fanout-on-read is appropriate for digest emails and broadcast scenarios. Real systems use both; the boundary is set by fanout cardinality.
At-least-once vs exactly-once
Exactly-once is mostly marketing. Build at-least-once + idempotent consumers. The cost is per-consumer dedup (cheap); the benefit is operational simplicity.
Per-channel queues vs unified queue
Per-channel queues isolate failures and enable per-channel scaling. Unified queue is simpler but creates correlated failures. Production answer: per-channel.
Kafka vs SNS+SQS vs custom
Kafka: high throughput, low latency, durable retention - the dominant choice at scale. SNS+SQS: managed, simpler, lower max throughput, no replay. Custom: rarely worth it. AWS shops often start with SNS+SQS and graduate to Kafka (MSK or self-managed) as scale grows.
Push vs pull subscription model
Push (broker pushes to consumers) has lower latency but consumers can be overwhelmed. Pull (consumers fetch when ready) has natural back-pressure. Kafka and most production systems are pull-based for this reason.
Synchronous fan-out vs async
Synchronous (compute fanout in the producer's API call) couples producer latency to subscriber count. Async (queue the event, fanout in the background) decouples but adds latency. Always async for non-trivial fanout; synchronous only for 1:1 or 1:few cases where the latency cost is acceptable.
Strict ordering vs throughput
Strict per-(user, topic) ordering requires per-user partitioning, which limits parallelism. If ordering doesn't matter (independent events), random partitioning gives more throughput. Be explicit at interview time about which events need ordering.
Provider lock-in vs portability
Building on Kafka + SES + Twilio + APNs/FCM ties you to those providers. Abstracting (Twilio behind a 'SMS provider' interface) costs engineering effort but enables provider failover. For business-critical channels, the abstraction is worth it.
Centralized vs per-product platforms
A single notifications platform serving all product teams gives consistency, shared deliverability reputation, and shared compliance posture. Cost: shared platform must accommodate all teams' needs. Most companies start per-product and consolidate.
Be ready for at least three of these. The first one is almost always asked.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →