Design a Notifications / Pub-Sub System (Kafka / SNS / SQS)
Fan-out at write vs read, at-least-once vs exactly-once, dead-letter queues, and the multi-channel delivery problem - one message, ten failure modes.
The problem
Design a notifications platform that ingests events from many producers and delivers them to many consumers across multiple channels (in-app, push, email, SMS, webhook). Producers don't know who the consumers are; consumers don't know who the producers are. The platform handles delivery guarantees, retries, user preferences, throttling, and back-pressure invisibly.
This problem has two layers: the pub-sub messaging substrate (Kafka / SNS / SQS) and the notifications product on top (per-user fan-out, channel selection, preferences, deliverability). Strong candidates separate these concerns explicitly and discuss the trade-offs at each layer.
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →What's the volume - events ingested/sec, notifications delivered/sec at peak?
- →What channels - in-app only, or also push (APNs/FCM), email (SES/SendGrid), SMS (Twilio), webhooks?
- →What delivery guarantees - at-most-once, at-least-once, exactly-once?
- →How are subscriptions defined - per-user preferences, per-topic, server-side rules engine?
- →What's the latency expectation - sub-second for in-app, minutes for digest emails?
- →Multi-tenant - one notification platform serving many product teams?
- →Compliance constraints - email unsubscribe (CAN-SPAM), SMS opt-in (TCPA), GDPR consent?
- →What's the fanout factor - 1:1 (transactional), 1:N (one event → many users), or 1:millions (broadcast)?
Requirements
Functional requirements
- ·POST /events - producer publishes an event with a topic + payload
- ·Subscribe consumers (user_id + channels + filters) to topics
- ·Deliver events to subscribed consumers' chosen channels
- ·Per-user preferences (mute topic, quiet hours, frequency caps)
- ·Retry failed deliveries with exponential backoff
- ·Dead-letter queue for permanently undeliverable messages
- ·Delivery audit log (when was the event sent, did it succeed, did the user see it)
Non-functional requirements
- Scale
- 1M events ingested/sec at peak. 100M users with avg 10 subscriptions each. Average fanout 100x → 100M deliveries/sec gross; after preference filtering, ~10M deliveries/sec actual. Burstable to 10x for product launches.
- Latency
- In-app push p99 < 5 seconds end-to-end (event ingested → user sees notification on connected device). Email/SMS p99 < 60 seconds (network transit + provider queue). Digest emails: hourly or daily, on schedule.
- Availability
- 99.95% for ingestion (producers should never be blocked). 99.9% for delivery (downstream channels have their own reliability). The platform must absorb provider outages (Twilio down) without losing events.
- Consistency
- At-least-once delivery is the default. Producers and consumers must be idempotent on event_id. Event ordering within a (user, topic) is preserved; cross-topic ordering is not guaranteed.
Capacity estimation
Storage
- Event log: 1M events/sec × 1KB avg payload = 1 GB/sec = 86 TB/day. Retain hot for 7 days: ~600 TB. Cold archive longer (S3): months of history at PB scale.
- Subscription index: 100M users × 10 subs avg × 100 bytes = 100 GB. Fits in DynamoDB / Redis.
- Delivery log: every delivery attempt. ~10M/sec × 200 bytes × 7 days hot = ~12 PB/week. Aggressive sampling (log only failures + 1% of successes) cuts this 100x.
Throughput
- Ingestion: 1M events/sec. Kafka tier: ~50 brokers at 20K events/sec/broker. Achievable with commodity hardware.
- Fanout: 1M events × 100 fanout × subs filtering = ~10M deliveries/sec across all channels.
- Per-channel: in-app via WebSockets ~5M/sec, push via APNs/FCM ~3M/sec, email via SES ~2M/sec, SMS via Twilio ~500K/sec, webhooks ~500K/sec.
Network
- Internal: producer → broker → fanout = 3x event volume internally = ~3 GB/s. Inter-AZ traffic dominates the network bill.
- External (channel egress): ~10 GB/s combined (push payloads, SMTP, HTTP webhooks).
Connections (in-app delivery)
- 100M users; ~30M concurrent on average. Each holds a persistent WebSocket. ~5,000 connection-gateway nodes at 6K conns/node.
Provider rate limits
- APNs: 9K-21K notifications/sec/connection (HTTP/2 multiplexing). Need many connections (~500) for 5M/sec push.
- SES: starts at 14/sec; warmup to 50K+/sec sustained. Plan provider warmup as a launch gate.
- Twilio SMS: 100/sec/long-code; need short-codes ($K/month each) for higher rates.
High-level architecture
The system is a layered pipeline: event ingestion → durable log → fan-out engine → per-channel delivery workers → external providers.
The defining design decision is the fan-out strategy: pre-compute deliveries at write time (fanout-on-write, like a feed), or compute at read time (fanout-on-read, lazy). Notifications usually want fanout-on-write for latency (push notifications can't wait for the user to ask).
The defining operational concern is back-pressure - when a downstream provider degrades (Twilio is rate-limiting us, SES is bouncing), the event log keeps filling up. The architecture must absorb this without losing events or spilling failure into upstream producers.
Event ingestion API
POST /events. Validates schema, assigns event_id (UUID), enriches with source metadata, publishes to the durable log. Returns 200 immediately - delivery is asynchronous.
Durable event log (Kafka)
The source of truth. Topic per event-type or per-tenant. Partitioned for parallelism (typically by user_id for ordering within a user). Replicated 3x. Retention: 7 days hot.
Subscription registry
Maps (topic, filter) → list of subscribed user_ids. Updated when users change preferences. Cached aggressively. Also stores per-user channel preferences and quiet hours.
Fan-out service
Consumes the event log. For each event, looks up subscribers, applies preference filters, produces per-(user, channel) delivery tasks onto channel-specific queues.
Channel delivery queues (SQS / Kafka topics)
One queue per channel (in-app, push, email, SMS, webhook). Decoupling here enables per-channel back-pressure, retry policies, and worker scaling.
Channel workers
Per-channel worker pools. Push workers talk to APNs/FCM. Email workers talk to SES. Each worker handles channel-specific batching, formatting, retry semantics.
Connection gateway (in-app)
Holds persistent WebSocket connections to user devices. Receives in-app delivery tasks, pushes to the right device. Falls back to push notification if disconnected.
Dead-letter queue (DLQ)
Permanently failed deliveries. Reviewed by ops; replayed manually after upstream fixes; analyzed for systemic failure patterns.
Preference + throttling service
Per-user state: subscription list, channel preferences, quiet hours, frequency caps ("no more than 3 marketing emails/day"). Consulted by fan-out before producing deliveries.
Audit / observability pipeline
Async. Tracks delivery state per (event, user, channel). Powers user-facing 'notification history' and ops-facing dashboards (delivery rates, latency, failure types).
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. Fanout-on-write vs fanout-on-read
The core architectural decision. It mirrors the news-feed problem but with sharper trade-offs because notifications usually require push (not pull).
Fanout-on-write (push model)
When an event arrives, immediately compute the recipient list and produce a delivery task per recipient. Writes scale with subscriber count.
Pros: low delivery latency (no waiting for the user to poll). Push notifications work (user doesn't need to be online).
Cons: write amplification = avg subscribers per event. A broadcast event (1M subscribers) = 1M deliveries to compute and queue.
Fanout-on-read (pull model)
Store events centrally; each consumer queries on connect for "events I missed since last poll". Writes are constant; reads are expensive.
Pros: minimal write amplification. Cleanly handles late subscribers.
Cons: requires polling. Latency = polling interval. Doesn't support push.
Hybrid (the production answer)
- For low-cardinality fanout (1:1, 1:few): fanout-on-write. Cheap.
- For high-cardinality fanout (broadcast to 1M users): fanout-on-write to a shared in-memory buffer per gateway; gateway pushes to all connected users (one shared event, per-user filter at gateway).
- For very-high-cardinality (broadcast to all 100M users): fanout-on-write to a "broadcast" topic; gateways subscribe to broadcasts and filter at the edge.
- Email digests (low-frequency, batched): fanout-on-read. Hourly job pulls events the user subscribed to, packages a digest.
The broadcast optimization
Naive: 1 broadcast event × 100M users = 100M delivery tasks. Wasted work.
Smart: write the event to a "broadcasts" queue. All gateway nodes subscribe. Each gateway iterates its connected users (~10K each), filters per-user preferences, pushes. Total work: 1 read per gateway × 5K gateways = 5K reads instead of 100M.
Late subscribers
A user comes online after the broadcast was sent. They missed it. Solutions:
- Backfill: on connect, gateway checks "did this user miss any in-app notifications?" via fanout-on-read against the durable log.
- Push fallback: if the user is offline at fanout time, also enqueue an OS push (APNs/FCM). When they open the app, in-app notifications are reconciled.
The interview signal: candidates who name both models and explain when each applies score higher than candidates who default to one.
2. Delivery guarantees: at-least-once is the right answer
Distributed messaging promises three flavors of delivery. Only one is practical at scale.
At-most-once
The producer sends; if anything goes wrong, the message is lost. Used by UDP-style protocols where loss is acceptable (logging, metrics). Not suitable for notifications - users miss real events.
At-least-once (the standard)
Every message is delivered one or more times. Duplicates happen on retries (network failures, broker restarts, consumer crashes after processing but before commit). Consumers must be idempotent on event_id.
This is what Kafka, SQS, RabbitMQ, NATS, and every other production messaging system actually delivers. Anyone who claims "exactly-once" without significant qualification is selling something.
Exactly-once (almost a marketing claim)
The message is delivered exactly once. Mathematically requires distributed transactions across producer, broker, and consumer.
Kafka has "exactly-once semantics" (EOS) via transactional producer + idempotent consumer + transactional commit of offsets. It works only when:
- Both producer and consumer are within the Kafka ecosystem.
- The "side effect" of the consumer is also a Kafka write.
The moment your consumer's side effect is "send an email" or "POST to a webhook", exactly-once is back to at-least-once + idempotency on the consumer.
Idempotency in practice
Each event has a globally unique event_id (UUID). Consumers maintain a recent-events bloom filter or KV store of "events I've already processed". Duplicates are detected and skipped.
Idempotency window: how far back to deduplicate. 24 hours is typical (matches the worst-case retry tail). Storage: ~100M events/day × 16 bytes (UUID hash) = 1.6 GB. Trivial.
Where duplicates come from
- Retries on transient failures (the message was processed; the ack failed; producer retries).
- Consumer crash after processing but before committing offset.
- Broker rebalance during processing.
- Multiple consumer instances accidentally processing the same partition.
The real-world contract
Kafka guarantees: "at-least-once delivery within retention; ordered within a partition; durable across N replicas". That's the truth. Build idempotent consumers; live happily.
3. Multi-channel delivery: push, email, SMS, webhook, in-app
Each channel has its own protocol, its own provider, its own rate limits, and its own failure modes. The architecture must isolate them.
In-app (WebSocket gateway)
Persistent connection to the user's device. On delivery: lookup connection, write to socket. If user has multiple devices, deliver to all. If no active connection, mark for push fallback.
Latency: sub-second when connected.
Failure mode: disconnected user. Fallback: enqueue push.
Push (APNs / FCM)
HTTP/2 connection to APNs (iOS) or FCM (Android). Per-device tokens stored in the user's device registry.
Latency: typically <5 seconds globally; APNs is fast.
Failure modes:
- InvalidToken (user uninstalled the app): mark token invalid; stop sending.
- Device offline: APNs holds the message for the user (last-only mode for badge/silent; first-only for visible).
- Quota exceeded: APNs allows ~9K-21K notifications per second per HTTP/2 connection. Open more connections.
Email (SES, SendGrid, Mailgun)
SMTP submission to the provider. Per-recipient.
Latency: 5-60 seconds typically; can be longer if the recipient's mailbox is slow.
Failure modes:
- Hard bounce (invalid address): mark unsubscribed; stop sending. Affects sender reputation if not handled.
- Soft bounce (full mailbox, temporary): retry over hours.
- Spam classification: degrades sender reputation; reduces deliverability for everyone.
- Rate limits: SES starts at 14/sec; warmup to 50K+/sec over weeks. Plan ahead.
SMS (Twilio, AWS SNS)
HTTP API to provider. Per-recipient. Most expensive channel.
Latency: 5-30 seconds typically.
Failure modes:
- Invalid number: mark invalid.
- Carrier filtering (spam-suspected SMS dropped): hard to detect; affects deliverability silently.
- Rate limits: long-code (regular phone number) ~1/sec; short-code ~100/sec; toll-free higher. Plan based on volume.
- Compliance: SMS is heavily regulated (TCPA in US, similar elsewhere). Opt-in must be explicit; STOP/HELP keywords are mandatory.
Webhook (HTTP POST to merchant endpoint)
For B2B notifications (Stripe-style). HMAC-signed payload.
Latency: depends on merchant endpoint.
Failure modes:
- Endpoint down: retry per schedule (1m, 5m, 30m, ..., 24h).
- Endpoint slow: per-endpoint concurrency limit; circuit breaker.
- Endpoint returns 4xx: retry but flag for ops review (likely merchant misconfiguration).
Per-channel queues
Each channel has its own queue and worker pool. A Twilio outage doesn't back-pressure email delivery. A spike in webhook deliveries doesn't crowd out push.
Channel selection logic
Per-user preferences determine which channels deliver which event types. "Mention" → in-app + push. "Daily digest" → email only. "Critical alert" → all channels. Selection happens in the fan-out stage, before queueing.
4. Back-pressure, dead-letter queues, and graceful degradation
When a downstream channel degrades, the messaging substrate must absorb the failure without losing data or cascading to producers.
The failure scenario
Twilio's API starts rate-limiting us (or returns 5xx). Our SMS workers keep retrying; the queue grows; eventually it overflows.
Back-pressure at the queue layer
Per-channel queues have a configurable max depth (e.g., 1M messages). When approaching the limit:
- Workers prioritize older messages (FIFO with TTL).
- Producers (the fan-out service) slow down or stop producing new SMS deliveries.
- Critically: the durable event log (Kafka) keeps accepting events. Other channels are unaffected.
The event log is the buffer that absorbs the back-pressure. As long as Kafka has retention, no events are lost - SMS deliveries are simply delayed.
Dead-letter queue (DLQ)
A delivery that fails after N retries (e.g., 7 attempts over 24 hours) goes to the DLQ. The message is preserved with the failure reason; it's not lost.
DLQ workflow:
- Alarms fire when DLQ depth exceeds threshold.
- Ops investigate (provider issue? bug? bad data?).
- Fix the root cause.
- Replay from DLQ back to the channel queue.
The DLQ is the safety net. Without it, transient failures permanently lose messages.
Selective shedding
At extreme load, the platform may need to drop low-priority deliveries to preserve high-priority ones. Examples:
- Marketing emails dropped before transactional ones.
- "FYI" notifications dropped before "critical" ones.
- Free-tier customers' notifications dropped before paid-tier.
Shedding policies live in the fan-out service. They consult per-event priority and current queue depths.
Circuit breakers
Each channel worker pool has a circuit breaker. If the provider's error rate exceeds threshold (e.g., 50% errors in 1 min), open the breaker - stop sending for a cool-off period. Re-test with a small probe; close the breaker when healthy.
This prevents pile-on (sending more requests when the provider is already struggling) and allows the provider to recover.
Cascading failure prevention
The architecture explicitly does NOT couple producer success to delivery success. Producers see 200 the moment the event is durably logged in Kafka. If every downstream channel is down, producers are unaffected.
This decoupling is non-negotiable. Coupling producer to delivery means a Twilio outage takes down our event ingestion.
5. User preferences, quiet hours, and frequency capping
Users want control over what they receive. The platform must enforce preferences before delivery, not after - sending and then asking forgiveness damages the brand and risks regulatory exposure.
Preference model
Per (user, topic, channel): allow / deny.
user_42:
topic="security_alert", channel="push": allow
topic="security_alert", channel="email": allow
topic="marketing", channel="email": allow
topic="marketing", channel="push": deny
topic="social_mention", channel="*": allow
Stored in DynamoDB / Redis. Looked up in the fan-out path.
Default preferences
New users get sensible defaults: transactional notifications all-channels, marketing email-only, social in-app-only. Users opt out of categories; rarely opt in to ones they don't get by default.
Quiet hours
"Don't send push notifications between 10pm-7am in the user's timezone". Implemented by checking the user's local time at delivery; defer to the next open window.
Edge case: a "critical" event (security alert) overrides quiet hours. The product team decides which categories are override-eligible.
Frequency caps
"No more than 3 marketing emails per week". Implemented by maintaining a per-user, per-category counter with a sliding window. At fan-out: check counter; if over cap, drop or queue for next window.
This is a mini rate-limiter built into the notifications path.
Bundling / digest mode
Some users prefer "send me a summary at 5pm" instead of real-time. The fan-out service writes events to a per-user pending-bundle table. A scheduled job (cron at 5pm per timezone) compiles the bundle and sends one email.
Compliance hooks
- Email: every email includes a one-click unsubscribe link (CAN-SPAM Section 5(a)(5)). Unsubscribe must take effect within 10 business days; in practice, immediately.
- SMS: STOP keyword unsubscribes; HELP keyword returns identification. Carrier-enforced.
- GDPR: per-user consent records with timestamp + source. Audit trail for consent withdrawal.
The preference service owns this state and is authoritative. Other services consult it; never bypass.
Subscription management UI
Users access /settings/notifications. Listed by topic; each topic has a per-channel toggle. Saved settings flow to the preference service via API; immediately effective on next event.
Enterprise / admin overrides
For B2B products, administrators may force certain notifications regardless of user preferences (security alerts, billing notices). The preference service has a "mandatory" flag per (org, topic) that overrides user toggles for org-internal events.
6. Ordering, partitioning, and the consumer scaling problem
Kafka's central trick: partitions provide parallelism for free, but ordering is preserved only within a partition. The choice of partition key shapes the entire system.
Partition by what?
- By user_id: all events for one user land on one partition. Per-user ordering preserved (a 'follow' event before a 'message' event is processed in order). Cost: hot users create hot partitions.
- By event_type: skewed - a high-volume event type (page_view) overwhelms its partition.
- By tenant_id: per-tenant ordering; balanced if tenant size is uniform.
- By random hash: perfect load balancing; no ordering guarantees.
For notifications, partition by user_id is the standard. Per-user ordering matters (notifications about a single conversation should arrive in order); cross-user ordering doesn't.
Hot partition mitigation
A celebrity user (millions of incoming notifications) creates a hot partition. Mitigation:
- Sub-partition celebrity users (user_id || sub_partition) for parallelism within the user, accept slight ordering loss.
- Promote celebrity events to a dedicated topic with more partitions.
Consumer parallelism
A topic with 100 partitions can be consumed by up to 100 parallel consumers. Adding more consumers doesn't help (idle consumers). Setting partition count requires forecasting peak parallelism needed; it's painful to change after the fact.
Rule of thumb: 2-3x more partitions than expected peak consumer count. A topic expected to peak at 50 consumers gets 100-150 partitions.
Consumer groups and offset management
Multiple workers form a consumer group; Kafka assigns partitions to workers within the group. Each worker tracks its committed offset.
Failure recovery: a worker dies; Kafka reassigns its partitions to surviving workers; they resume from the last committed offset. Some duplicates possible (events processed but not committed) - hence at-least-once + idempotency.
Rebalance pauses
When the consumer group changes (worker added/removed), Kafka rebalances - all consumers pause briefly while reassignment happens. Modern Kafka (cooperative rebalancing) reduces pause to single-partition impact instead of group-wide.
Lag monitoring
The KPI for any pub-sub system is "consumer lag" - how far behind real-time the consumer is. Alert at >1 minute lag for in-app channels, >10 minutes for email.
Lag is the integral over time of (production rate - consumption rate). When production exceeds consumption, lag grows linearly. Alarm immediately - by the time you notice manually, you're hours behind.
Consumer scaling
Lag growing → add consumers (up to partition count). If at partition count, the only options are:
- Horizontal: more partitions (requires reindexing; not zero-downtime).
- Vertical: faster consumer instances.
- Reduce work per event: simplify processing, push expensive work async.
Trade-offs
Fanout-on-write vs fanout-on-read
Fanout-on-write is the default for low-latency push. Fanout-on-read is appropriate for digest emails and broadcast scenarios. Real systems use both; the boundary is set by fanout cardinality.
At-least-once vs exactly-once
Exactly-once is mostly marketing. Build at-least-once + idempotent consumers. The cost is per-consumer dedup (cheap); the benefit is operational simplicity.
Per-channel queues vs unified queue
Per-channel queues isolate failures and enable per-channel scaling. Unified queue is simpler but creates correlated failures. Production answer: per-channel.
Kafka vs SNS+SQS vs custom
Kafka: high throughput, low latency, durable retention - the dominant choice at scale. SNS+SQS: managed, simpler, lower max throughput, no replay. Custom: rarely worth it. AWS shops often start with SNS+SQS and graduate to Kafka (MSK or self-managed) as scale grows.
Push vs pull subscription model
Push (broker pushes to consumers) has lower latency but consumers can be overwhelmed. Pull (consumers fetch when ready) has natural back-pressure. Kafka and most production systems are pull-based for this reason.
Synchronous fan-out vs async
Synchronous (compute fanout in the producer's API call) couples producer latency to subscriber count. Async (queue the event, fanout in the background) decouples but adds latency. Always async for non-trivial fanout; synchronous only for 1:1 or 1:few cases where the latency cost is acceptable.
Strict ordering vs throughput
Strict per-(user, topic) ordering requires per-user partitioning, which limits parallelism. If ordering doesn't matter (independent events), random partitioning gives more throughput. Be explicit at interview time about which events need ordering.
Provider lock-in vs portability
Building on Kafka + SES + Twilio + APNs/FCM ties you to those providers. Abstracting (Twilio behind a 'SMS provider' interface) costs engineering effort but enables provider failover. For business-critical channels, the abstraction is worth it.
Centralized vs per-product platforms
A single notifications platform serving all product teams gives consistency, shared deliverability reputation, and shared compliance posture. Cost: shared platform must accommodate all teams' needs. Most companies start per-product and consolidate.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How do you handle a region-wide outage of your push notification provider?
- ?What changes if a single broadcast event must reach all 100M users in under 10 seconds?
- ?How do you implement notification preferences that respect GDPR consent records?
- ?What's your strategy for warming SES sending reputation when launching a new domain?
- ?How would you support delayed scheduled notifications ("send this at 9am tomorrow")?
- ?How do you A/B test notification content (subject lines, send times) at scale?
- ?What's your deduplication story when the same logical event can be triggered by multiple producers?
- ?How would you migrate from at-least-once to exactly-once (or convince the user it's not needed)?
Related system design topics
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →