Realtime

Design a Chat System (WhatsApp / Slack)

Hard Premium

Long-lived connections, ordering guarantees, presence, and the difference between 1:1 chat and a 50K-member group.

The problem

Design the backend for a real-time chat product. Users send messages to each other (1:1 and groups), see who's online, get read receipts and typing indicators, receive push notifications when offline, and have access to message history.

This is one of the most layered system design questions because it spans long-lived connection management (WebSockets), ordering and consistency (per-conversation), presence (a stateful global service), push notifications (third-party integrations), and storage (hot vs cold tiering).

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

→Are we designing 1:1 chat only, group chat, or both? If groups, what's the max group size (10? 1K? 200K)?
→Are messages end-to-end encrypted (Signal-protocol style) or server-readable?
→What features beyond text - media, reactions, replies, threads, voice messages?
→What latency budget for message delivery? (<100ms typing-feel, vs <1s acceptable)
→How long is message history retained? Is full history searchable?
→What's the scale - DAU, messages/sec at peak, max concurrent connections?
→Do we need cross-device sync (a message read on phone shows as read on desktop)?
→Do we need delivery guarantees (exactly-once vs at-least-once)?

Requirements

Functional requirements

·Send 1:1 and group messages (text + media references)
·Real-time delivery to online recipients via persistent connection
·Push notifications to offline recipients (APNs / FCM)
·Read receipts and typing indicators
·Presence (online / offline / last-seen)
·Message history with cursor-based pagination
·Cross-device sync (phone, desktop, web)

Non-functional requirements

Scale: 2B users, 500M concurrent connections at peak, 100B messages/day, max group size 200K members. 2-year hot retention; cold archive forever.
Latency: Message delivery to online recipient p99 < 200ms (network round-trip is most of the budget). Send-to-recipient-screen p99 < 500ms.
Availability: 99.99% message delivery. A message accepted by the server must never be lost. Brief connection outages are recoverable - clients reconnect and replay.
Consistency: Per-conversation message order is strict (every participant sees the same order). Read receipts and typing indicators are best-effort (drops are acceptable). Cross-device sync is eventually consistent within seconds.

Capacity estimation

Storage

Messages: 100B/day × 365 × 2y hot = 73 trillion messages. ~200 bytes/message = 14.6 PB. With 3x replication: ~44 PB hot. Cold archive: PB-scale, S3 / Glacier / proprietary cold tier.
Per-user metadata (last-read pointers, conversation list): ~1KB × 2B users = 2 TB. Trivial.
Group membership: a 200K-person group has 200K rows in the membership table. 1% of users in groups of 1000+: 20M users × 1KB = 20 TB.

Connections

500M concurrent persistent connections at peak. At 1KB memory overhead/connection (TCP buffer + WS state) = 500 GB across the connection-gateway fleet. With 100K connections/node, that's 5,000 nodes globally.

Throughput

Sends: 100B/day = ~1.2M messages/sec average, peak ~5M messages/sec.
Receives (after fanout): for 1:1 = 1.2M deliveries/sec. For group fanout, depends on average group size - if 10% of messages go to a 100-person group, that's 12M deliveries/sec on that path.

Bandwidth

Persistent connection keepalives: 500M conns × 1 keepalive/30s × 50B = ~800 MB/s globally. Trivial per node.
Active messaging: 1.2M msg/sec × 200B = 240 MB/s globally. The data plane is small; the metadata + control plane (presence, read receipts) dominates.

High-level architecture

The system has three tiers that must be reasoned about separately:

Connection tier - manages 500M persistent WebSocket connections. Stateful (each connection terminates on a specific node). Handles fanout to online users.
Messaging tier - stateless. Validates, orders, persists messages. Publishes delivery events.
Storage + supporting services - durable message store, group membership, presence, push notifications.

A message's life: client sends over WebSocket → connection node forwards to message service → message service assigns ID and persists → publishes "delivered" event → connection nodes for online recipients push to their sockets → push notification service handles offline recipients.

WebSocket gateway (connection tier)

Terminates client persistent connections. Routes inbound messages to the message service. Receives delivery events and pushes to connected sockets. Stateful - sharded by user_id, but a user's connection lands on whichever node has capacity.

Connection registry

Maps user_id → (gateway_node, connection_id). Updated on connect/disconnect. Used by message service to find the right gateway when delivering.

Message service

Stateless. Assigns conversation-scoped message IDs (Snowflake or per-conversation counter), persists, publishes 'message_sent' event.

Message store

Time-series storage partitioned by conversation_id. Recent (hot) in fast storage; older (cold) tiered. Cassandra / HBase / proprietary.

Group membership service

Stores group → members and member → groups indexes. Both directions matter: fanout needs members(group); login needs groups(user).

Presence service

Tracks online status. Updated on connection events. Published as a stream so clients can subscribe to status changes for relevant contacts.

Push notification service

Bridges offline message events to APNs (iOS) / FCM (Android). Handles batching, deduplication, badge count.

Media service

Out of band. Client uploads media to object storage (signed URL); message contains the URL. Decoupled from the messaging path.

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. Connection management at 500M concurrent

WebSockets are the dominant choice. SSE doesn't support client-to-server messages over the same channel; long-poll has too much per-message overhead at this scale.

Per-node connection capacity
A well-tuned Linux box (jumbo NIC, tuned TCP, kernel with epoll) can handle 100K-500K idle WebSocket connections per process. Per-process memory is dominated by TCP send/receive buffers (~5KB each, default). Tune buffers down for chat (small messages → small buffers).

Sharding strategy
Don't shard by user_id - it locks a user to a single failed node. Instead: client connects to the regional load balancer; LB routes to any healthy node with capacity. Once connected, the connection registry records (user_id → node_id, connection_id) so other services can deliver to it.

Reconnection storm
If a node fails, 100K-500K clients reconnect simultaneously. Mitigation: clients use exponential backoff with jitter on reconnect. Server side: ELB/ALB limits new conns/sec per IP and per node.

Connection registry
Hot KV store (Redis Cluster). On connect: SET user_id:device_id → node_id, TTL 60s. Heartbeat every 30s refreshes TTL. On graceful disconnect: DEL. On node failure: TTL expires, the user is implicitly offline. This trades a 60s "last-seen" lag for not needing distributed consensus on connection state.

Multi-device
A user can have 4 devices (phone, tablet, laptop, web). Each device gets its own connection. Connection registry is keyed by (user_id, device_id). Delivery iterates over all devices.

2. Message ordering and IDs

Per-conversation order is strict; cross-conversation order is irrelevant.

ID scheme
Per-conversation Lamport-style counter. The message service maintains (conversation_id → next_message_id) in a fast KV with strong consistency (DynamoDB conditional update, Spanner, or Raft-coordinated). On send: atomic increment, attach ID, persist message, publish event.

Why per-conversation rather than global? Two reasons. (1) Global Snowflake IDs are time-ordered but not strictly increasing across machines under clock skew - in a chat where order matters, that's a defect. (2) Coordination is cheap when scoped to a single conversation (one row in the KV), but expensive globally.

Sender clocks vs server clocks
Don't trust the client's clock. Order by server-assigned ID, not client timestamp. Show the server timestamp on the receiver's screen.

Out-of-order delivery
Even with strict server ordering, network reordering can deliver message 5 before message 4 to a recipient. The recipient client reorders by ID before display. Client-side buffer of 1-2 seconds is usually enough to absorb reordering.

Idempotency
Client retries on send (network flake). Each retry must produce the same server-side ID, not duplicate. Solution: client includes a UUID (client_message_id) on send. The server: "if (conversation_id, client_message_id) already exists, return existing ID; else assign new". Idempotent within the conversation.

Deletion / edit semantics
Tombstone the message (mark deleted=true, content=null). Don't actually delete - downstream consumers (search index, archive) need the event. Show "this message was deleted" in the UI.

3. Presence: the stateful global problem

Presence ("is X online right now?") is deceptively hard at 500M concurrent users.

Naive approach
Connection registry has user_id → online. Every contact polls. Fails: a user with 500 contacts triggers 500 lookups every few seconds.

Subscription approach
Clients subscribe to presence updates for users in their contact list. Presence service publishes connect/disconnect events to a per-user pub/sub topic. Subscribers receive deltas only.

At 2B users × 500 contacts each = 1T subscriptions. Doesn't fit in any single system. Shard the presence service by the observed user (the one whose status is published), and clients open subscriptions to all shards their contacts live on. In practice: ~100 presence shards globally, each with ~5B subscriptions.

Last-seen vs online
"Online" is binary (have an active connection). "Last-seen" is a timestamp updated on disconnect. Both are derived from the connection registry.

Privacy
Most products let users hide presence. The presence service consults a privacy table before publishing. Hidden users still appear "online" to themselves and explicit allowlist (e.g., favorites).

Scale relief
Most contacts a user has are dormant. Don't subscribe to all 500 contacts on connect - subscribe lazily as the user opens conversations. This reduces active subscriptions by 10-100x.

WhatsApp's actual approach: presence is best-effort, eventually consistent, with privacy controls. Presence drops during cell tower handoffs; users are familiar with the model. Don't over-engineer.

4. Group messages and large-group fanout

1:1 is easy: one sender → one recipient. Groups are a fanout problem that scales with group size.

Small groups (≤100 members)
Synchronous fanout. Message service writes the message once to the message store, then publishes one delivery event per member. Members' connection nodes push to their sockets. Total work: O(group_size) per message. Trivial.

Medium groups (100 - 10K)
Async fanout via a dedicated queue. Avoid blocking the sender's send-ack on the full fanout. Sender sees "delivered to server"; recipients see message within a couple of seconds.

Large groups (10K - 200K)
Pure fanout-on-send is expensive. A 100-message-burst in a 200K group = 20M deliveries.

Strategies for large groups

Hierarchical fanout: split the member list into chunks (1K each); fanout workers process chunks in parallel.
Pull-based reads for inactive members: only fanout to currently online members. Inactive members' clients pull on next open via "messages since last_read" API.
Mute / read-only optimizations: members who muted notifications don't get push - skip the push notification path.
Read receipts compression: don't fanout per-member read receipts to the entire group. Aggregate into "X people have read" indicators.

Group membership lookup
A 200K-member group has a 200K-row membership list. Cache it in Redis keyed by group_id. Update incrementally on join/leave. Eviction policy: LRU; cold groups rebuild from the source DB.

The very large channel (Slack #general at a 50K-person company)
Treat as a fanout-on-read product. Members pull on open; only mentions trigger fanout-on-send.

5. Push notifications and the offline path

When a recipient is offline (no active connection), the message must reach them via OS push (APNs / FCM).

Trigger
Connection registry says recipient is offline → message service publishes a 'push' event to the push pipeline.

Push pipeline

Lookup device tokens for the user (a user can have multiple devices).
Apply per-device preferences (DND windows, mute settings, group-level mutes).
Format payload (sender, message snippet, badge count).
Send to APNs/FCM via batched HTTP/2.
Track success/failure (retry transient failures; mark token invalid if APNs returns InvalidToken).

Dedup across re-arrival
A user comes online while a push was in-flight. They get both an in-app notification and an OS push. Dedup is impossible without coordination - mitigation: clients suppress the notification banner if the conversation is currently focused.

Badge count
Source of truth: count of unread messages per user. Maintained server-side. Sent on each push. Resync on app open.

End-to-end encryption interaction
With E2EE, the server can't read message content. Push payload contains only metadata (sender, conversation_id). Client decrypts the message after receiving the push via WebSocket. APNs alert text is generic ("New message"). UX is degraded but it's the price of privacy.

Failure modes
APNs/FCM down → backoff and retry; messages still sit in the message store, available on next online connection. Don't block the messaging path on push success.

6. Storage tiering and history

100B messages/day × 200 bytes × 365 × 2y = 14.6 PB hot. Much of this is rarely read.

Hot tier (last 30 days)
Cassandra / HBase. Partition key = conversation_id, clustering key = message_id (descending). Range scan on partition for "fetch last N messages" is the dominant query. Sub-100ms.

Warm tier (30 days - 1 year)
Same store, different cluster, smaller compute. Slower reads acceptable.

Cold archive (>1 year)
S3 / Glacier / proprietary cold storage. Indexed by (conversation_id, time_range). Restored to warm on demand for full-history searches.

Per-user inbox vs per-conversation log
Two valid models.

Per-user: each user has a list of messages they received. Easy "all my unread" query. Bad for groups (write amplification = group size per message).
Per-conversation: messages stored once per conversation. Each user has a (conversation_id → last_read_message_id) cursor. Reads are conversation-scoped. Writes are constant.

The per-conversation model wins at scale and is what WhatsApp/Signal/Telegram use.

Read pointers
Per (user, conversation): last_read_message_id. Updated on read. Used for unread counts and read receipts. Tiny rows; updated frequently. Hot KV store, eventually consistent.

Search
Out of band. Async indexer reads message events from Kafka, writes to ElasticSearch keyed by (user_id, conversation_id, message_id). Search is per-user (privacy) and conversation-scoped. Cold-tier messages need re-indexing on first search.

Trade-offs

Server-readable vs end-to-end encrypted
E2EE (Signal protocol) eliminates server-side search, server-side spam filtering, server-side analytics, and easy backup. It's the right call for privacy-first products (WhatsApp, Signal). It's the wrong call for productivity tools where compliance and search are required (Slack). Mention which model you're designing for early.

Per-conversation counter vs Snowflake
Per-conversation counter gives strict ordering at the cost of one coordination point per conversation. Snowflake is wait-free but allows tiny ordering violations under clock skew. Chat needs strict order - take the counter.

At-least-once vs exactly-once delivery
Exactly-once is impossible without coordination on the receiver side. At-least-once + idempotent message IDs (client_message_id) gives you the practical equivalent: duplicate messages on retry are deduped at the receiver. This is the standard pattern.

Connection-tier statefulness
You can't make connection nodes stateless without losing the persistent connection. Accept the statefulness; design around it (sharding, registry lookups, graceful failover). A common interview anti-pattern is to over-design connection statelessness.

Group size limits
Telegram supports 200K-member groups; Discord supports millions in some channels. The architecture changes meaningfully above ~10K. Be explicit at interview time about what bound you're designing for and what changes at the next tier up.

Storage cost vs full history
Storing full history for 2B users gets expensive. WhatsApp's solution: full history lives on-device, the server is a transit relay (24-hour buffer). Telegram's solution: cloud storage, full history searchable. Different products, different trade-offs - call it out.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

?How would you support voice/video calls on the same connection layer?
?What changes if all messages must be end-to-end encrypted?
?How do you scale to a 1M-member group (e.g., Telegram broadcast channel)?
?How do you implement message editing and deletion with read receipts in flight?
?How would you sync messages across devices (multi-device with E2EE)?
?What's your plan for handling abusive users at scale (block, report, ban)?
?How do you handle a region-level outage of the connection tier?
?What does the message search architecture look like, and what's the privacy story?

Companies that test this topic

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →