We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Long-lived connections, ordering guarantees, presence, and the difference between 1:1 chat and a 50K-member group.
Design the backend for a real-time chat product. Users send messages to each other (1:1 and groups), see who's online, get read receipts and typing indicators, receive push notifications when offline, and have access to message history.
This is one of the most layered system design questions because it spans long-lived connection management (WebSockets), ordering and consistency (per-conversation), presence (a stateful global service), push notifications (third-party integrations), and storage (hot vs cold tiering).
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Storage
Connections
Throughput
Bandwidth
The system has three tiers that must be reasoned about separately:
A message's life: client sends over WebSocket → connection node forwards to message service → message service assigns ID and persists → publishes "delivered" event → connection nodes for online recipients push to their sockets → push notification service handles offline recipients.
Terminates client persistent connections. Routes inbound messages to the message service. Receives delivery events and pushes to connected sockets. Stateful - sharded by user_id, but a user's connection lands on whichever node has capacity.
Maps user_id → (gateway_node, connection_id). Updated on connect/disconnect. Used by message service to find the right gateway when delivering.
Stateless. Assigns conversation-scoped message IDs (Snowflake or per-conversation counter), persists, publishes 'message_sent' event.
Time-series storage partitioned by conversation_id. Recent (hot) in fast storage; older (cold) tiered. Cassandra / HBase / proprietary.
Stores group → members and member → groups indexes. Both directions matter: fanout needs members(group); login needs groups(user).
Tracks online status. Updated on connection events. Published as a stream so clients can subscribe to status changes for relevant contacts.
Bridges offline message events to APNs (iOS) / FCM (Android). Handles batching, deduplication, badge count.
Out of band. Client uploads media to object storage (signed URL); message contains the URL. Decoupled from the messaging path.
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
WebSockets are the dominant choice. SSE doesn't support client-to-server messages over the same channel; long-poll has too much per-message overhead at this scale.
Per-node connection capacity
A well-tuned Linux box (jumbo NIC, tuned TCP, kernel with epoll) can handle 100K-500K idle WebSocket connections per process. Per-process memory is dominated by TCP send/receive buffers (~5KB each, default). Tune buffers down for chat (small messages → small buffers).
Sharding strategy
Don't shard by user_id - it locks a user to a single failed node. Instead: client connects to the regional load balancer; LB routes to any healthy node with capacity. Once connected, the connection registry records (user_id → node_id, connection_id) so other services can deliver to it.
Reconnection storm
If a node fails, 100K-500K clients reconnect simultaneously. Mitigation: clients use exponential backoff with jitter on reconnect. Server side: ELB/ALB limits new conns/sec per IP and per node.
Connection registry
Hot KV store (Redis Cluster). On connect: SET user_id:device_id → node_id, TTL 60s. Heartbeat every 30s refreshes TTL. On graceful disconnect: DEL. On node failure: TTL expires, the user is implicitly offline. This trades a 60s "last-seen" lag for not needing distributed consensus on connection state.
Multi-device
A user can have 4 devices (phone, tablet, laptop, web). Each device gets its own connection. Connection registry is keyed by (user_id, device_id). Delivery iterates over all devices.
Per-conversation order is strict; cross-conversation order is irrelevant.
ID scheme
Per-conversation Lamport-style counter. The message service maintains (conversation_id → next_message_id) in a fast KV with strong consistency (DynamoDB conditional update, Spanner, or Raft-coordinated). On send: atomic increment, attach ID, persist message, publish event.
Why per-conversation rather than global? Two reasons. (1) Global Snowflake IDs are time-ordered but not strictly increasing across machines under clock skew - in a chat where order matters, that's a defect. (2) Coordination is cheap when scoped to a single conversation (one row in the KV), but expensive globally.
Sender clocks vs server clocks
Don't trust the client's clock. Order by server-assigned ID, not client timestamp. Show the server timestamp on the receiver's screen.
Out-of-order delivery
Even with strict server ordering, network reordering can deliver message 5 before message 4 to a recipient. The recipient client reorders by ID before display. Client-side buffer of 1-2 seconds is usually enough to absorb reordering.
Idempotency
Client retries on send (network flake). Each retry must produce the same server-side ID, not duplicate. Solution: client includes a UUID (client_message_id) on send. The server: "if (conversation_id, client_message_id) already exists, return existing ID; else assign new". Idempotent within the conversation.
Deletion / edit semantics
Tombstone the message (mark deleted=true, content=null). Don't actually delete - downstream consumers (search index, archive) need the event. Show "this message was deleted" in the UI.
Presence ("is X online right now?") is deceptively hard at 500M concurrent users.
Naive approach
Connection registry has user_id → online. Every contact polls. Fails: a user with 500 contacts triggers 500 lookups every few seconds.
Subscription approach
Clients subscribe to presence updates for users in their contact list. Presence service publishes connect/disconnect events to a per-user pub/sub topic. Subscribers receive deltas only.
At 2B users × 500 contacts each = 1T subscriptions. Doesn't fit in any single system. Shard the presence service by the observed user (the one whose status is published), and clients open subscriptions to all shards their contacts live on. In practice: ~100 presence shards globally, each with ~5B subscriptions.
Last-seen vs online
"Online" is binary (have an active connection). "Last-seen" is a timestamp updated on disconnect. Both are derived from the connection registry.
Privacy
Most products let users hide presence. The presence service consults a privacy table before publishing. Hidden users still appear "online" to themselves and explicit allowlist (e.g., favorites).
Scale relief
Most contacts a user has are dormant. Don't subscribe to all 500 contacts on connect - subscribe lazily as the user opens conversations. This reduces active subscriptions by 10-100x.
WhatsApp's actual approach: presence is best-effort, eventually consistent, with privacy controls. Presence drops during cell tower handoffs; users are familiar with the model. Don't over-engineer.
1:1 is easy: one sender → one recipient. Groups are a fanout problem that scales with group size.
Small groups (≤100 members)
Synchronous fanout. Message service writes the message once to the message store, then publishes one delivery event per member. Members' connection nodes push to their sockets. Total work: O(group_size) per message. Trivial.
Medium groups (100 - 10K)
Async fanout via a dedicated queue. Avoid blocking the sender's send-ack on the full fanout. Sender sees "delivered to server"; recipients see message within a couple of seconds.
Large groups (10K - 200K)
Pure fanout-on-send is expensive. A 100-message-burst in a 200K group = 20M deliveries.
Strategies for large groups
Group membership lookup
A 200K-member group has a 200K-row membership list. Cache it in Redis keyed by group_id. Update incrementally on join/leave. Eviction policy: LRU; cold groups rebuild from the source DB.
The very large channel (Slack #general at a 50K-person company)
Treat as a fanout-on-read product. Members pull on open; only mentions trigger fanout-on-send.
When a recipient is offline (no active connection), the message must reach them via OS push (APNs / FCM).
Trigger
Connection registry says recipient is offline → message service publishes a 'push' event to the push pipeline.
Push pipeline
Dedup across re-arrival
A user comes online while a push was in-flight. They get both an in-app notification and an OS push. Dedup is impossible without coordination - mitigation: clients suppress the notification banner if the conversation is currently focused.
Badge count
Source of truth: count of unread messages per user. Maintained server-side. Sent on each push. Resync on app open.
End-to-end encryption interaction
With E2EE, the server can't read message content. Push payload contains only metadata (sender, conversation_id). Client decrypts the message after receiving the push via WebSocket. APNs alert text is generic ("New message"). UX is degraded but it's the price of privacy.
Failure modes
APNs/FCM down → backoff and retry; messages still sit in the message store, available on next online connection. Don't block the messaging path on push success.
100B messages/day × 200 bytes × 365 × 2y = 14.6 PB hot. Much of this is rarely read.
Hot tier (last 30 days)
Cassandra / HBase. Partition key = conversation_id, clustering key = message_id (descending). Range scan on partition for "fetch last N messages" is the dominant query. Sub-100ms.
Warm tier (30 days - 1 year)
Same store, different cluster, smaller compute. Slower reads acceptable.
Cold archive (>1 year)
S3 / Glacier / proprietary cold storage. Indexed by (conversation_id, time_range). Restored to warm on demand for full-history searches.
Per-user inbox vs per-conversation log
Two valid models.
The per-conversation model wins at scale and is what WhatsApp/Signal/Telegram use.
Read pointers
Per (user, conversation): last_read_message_id. Updated on read. Used for unread counts and read receipts. Tiny rows; updated frequently. Hot KV store, eventually consistent.
Search
Out of band. Async indexer reads message events from Kafka, writes to ElasticSearch keyed by (user_id, conversation_id, message_id). Search is per-user (privacy) and conversation-scoped. Cold-tier messages need re-indexing on first search.
Server-readable vs end-to-end encrypted
E2EE (Signal protocol) eliminates server-side search, server-side spam filtering, server-side analytics, and easy backup. It's the right call for privacy-first products (WhatsApp, Signal). It's the wrong call for productivity tools where compliance and search are required (Slack). Mention which model you're designing for early.
Per-conversation counter vs Snowflake
Per-conversation counter gives strict ordering at the cost of one coordination point per conversation. Snowflake is wait-free but allows tiny ordering violations under clock skew. Chat needs strict order - take the counter.
At-least-once vs exactly-once delivery
Exactly-once is impossible without coordination on the receiver side. At-least-once + idempotent message IDs (client_message_id) gives you the practical equivalent: duplicate messages on retry are deduped at the receiver. This is the standard pattern.
Connection-tier statefulness
You can't make connection nodes stateless without losing the persistent connection. Accept the statefulness; design around it (sharding, registry lookups, graceful failover). A common interview anti-pattern is to over-design connection statelessness.
Group size limits
Telegram supports 200K-member groups; Discord supports millions in some channels. The architecture changes meaningfully above ~10K. Be explicit at interview time about what bound you're designing for and what changes at the next tier up.
Storage cost vs full history
Storing full history for 2B users gets expensive. WhatsApp's solution: full history lives on-device, the server is a transit relay (24-hour buffer). Telegram's solution: cloud storage, full history searchable. Different products, different trade-offs - call it out.
Be ready for at least three of these. The first one is almost always asked.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →