Realtime

Design a Real-Time Collaboration System (Figma / Google Docs)

Hard Premium

CRDTs vs OT, presence, cursor broadcasting, and conflict-free merging when 50 people edit the same doc at once.

The problem

Design a collaborative editor where multiple users see each other's edits in real time, without conflicts, without losing work, and without a central lock. Users can be on flaky networks, can go offline mid-edit, can rejoin hours later, and the system must merge their work coherently.

This is the canonical "no easy answer" system design problem. It forces a candidate to pick between two famously hard families - operational transformation (OT) and conflict-free replicated data types (CRDTs) - and defend the choice. Strong candidates explain why a naive last-write-wins approach loses data and walk through a concrete merge example.

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

→What's the document model - linear text (Docs), tree (Figma layers), spreadsheet cells, or rich structured?
→How many concurrent editors per document - 5, 50, 500? The protocol changes sharply above ~50.
→Latency target - sub-100ms cursor + edit propagation, or eventual sync?
→Offline editing required, or always-online?
→Permissions model - read-only viewers, comment-only, full edit, owner?
→History / undo - per-user undo or global undo? How far back?
→Storage durability - every keystroke captured or coarser snapshots?
→Multi-device for the same user (laptop + phone editing simultaneously)?

Requirements

Functional requirements

·Open document - load current state + subscribe to live updates
·Apply edits locally and broadcast to all collaborators within ~100ms
·Merge concurrent edits without losing or duplicating user intent
·Show presence (who's in the doc) and live cursors with selection range
·Persistent undo / redo respecting other users' edits
·Offline edits queued locally; sync and merge on reconnect
·Document history with point-in-time recovery

Non-functional requirements

Scale: 100M documents, 10M concurrent active editors globally. Median doc has 1-3 active editors; tail has 50+. 100K edits/sec aggregate at peak. Per doc: up to 500 ops/sec sustained during heavy editing.
Latency: Edit propagation p99 < 100ms within a region, < 250ms cross-region. Cursor updates p99 < 50ms (humans notice cursor lag faster than text lag). Document open p99 < 500ms cold.
Availability: 99.95% for the edit path. Editors who lose connection should keep editing locally; reconnect must be transparent. Read-only fallback must be available even if the realtime layer is degraded.
Consistency: Strong eventual consistency: all replicas that have seen the same set of operations converge to the same document state. No global linear order is required, but causal order within a session must be preserved.

Capacity estimation

Storage

Documents: 100M docs × 100KB avg = 10 TB raw. Operation log: 10x doc size for active docs = ~50 TB hot.
Snapshots: every N ops or M minutes, write a compacted snapshot. Reduces replay cost on cold open.
History retention: 90 days hot, 1 year warm, infinite cold (S3 + lifecycle).

Throughput

Edits: 100K ops/sec aggregate. Each op replicates to ~3 collaborators on average → 300K outbound deliveries/sec.
Cursors: ~5x edit rate at peak (users move cursors more than they type). 500K cursor messages/sec - cheap, ephemeral, drop-OK.
Snapshots: every 100 ops or 5 min per active doc = ~10K snapshot writes/sec.

Connections

10M concurrent editors holding WebSockets. ~6K conns/gateway → ~1,700 gateway nodes.
Per-doc fan-out lives at the gateway: when a gateway holds N collaborators on the same doc, broadcast is in-process (no network hop per recipient).

Memory

Active doc state cached at coordinators: 1M active docs × 500KB CRDT state = 500 GB. Sharded across edit servers.

Network

Average edit op = ~200 bytes (CRDT op with vector clock). Cursor = ~80 bytes. Total broadcast bandwidth ~3 GB/s steady, 10 GB/s peak.

High-level architecture

The system layers on top of a durable operation log per document. Clients send ops to the nearest edit server; the edit server orders ops within the doc, writes them to the log, and fans them out to all subscribed clients.

The defining decision is the merge algorithm. CRDTs make the merge associative, commutative, idempotent - any client can apply ops in any order and converge. OT requires the server to transform ops against concurrent ops before sending. CRDTs win at scale because the server is no longer a transform bottleneck; OT wins for compactness on linear text.

The defining operational problem is reconnection. A client offline for an hour comes back with 200 local ops; the server has 500 ops it missed; both must converge without conflict and without showing the user a "your edits were lost" dialog.

Edit gateway (WebSocket)

Holds persistent connections from clients. One gateway typically owns all collaborators on a given doc to make fan-out in-process. Routes ops from clients to the doc's edit server; broadcasts ops back.

Doc edit server

One logical owner per active doc (sharded by doc_id). Orders ops, applies CRDT merge, writes to op log, publishes to gateways. Stateless across docs but stateful per doc.

Op log (Kafka / append-only store)

Durable record of every operation. Source of truth for replay. Partitioned by doc_id for ordering within a doc.

Snapshot store

Periodic compacted state per doc. Lets cold opens skip replaying millions of ops. Stored in object storage with metadata in DynamoDB.

Presence service

Tracks who's in each doc (user_id, cursor position, selection, idle state). Ephemeral - lives in Redis with short TTL. Broadcast over the same WebSocket as edits.

Auth / permissions service

Resolves (user, doc) → role (owner / editor / commenter / viewer / none). Consulted on connect; cached per session. Enforces edit gating before ops are accepted.

Sync coordinator

Handles offline reconnect. Client sends "I have ops up to vector clock V"; server computes the diff and replays missing ops + accepts client's queued ops via CRDT merge.

Document history service

Async. Builds named versions, branch / restore. Reads from snapshots + op log. Powers user-facing 'version history' UI.

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. CRDTs vs OT: the core merge decision

Two families solve the same problem with very different trade-offs.

Operational Transformation (OT)
Each op is transformed against concurrent ops before being applied. If user A inserts 'X' at position 5 and user B concurrently inserts 'Y' at position 5, the server transforms B's op against A's so they don't both target the same index.

Pros: ops are tiny (a position + a char). Linear text editing is well-studied. Google Docs uses OT.
Cons: the transform function is famously hard to get right (the OT community has published incorrect algorithms for decades). Requires a central server to compute transforms - hard to scale horizontally per doc. Hostile to peer-to-peer.

Conflict-free Replicated Data Types (CRDTs)
Ops are designed so that applying them in any order yields the same result. Each character has a unique, immutable ID (often a fractional position or a Lamport timestamp); inserts and deletes commute by construction.

Pros: no central transform server. Peer-to-peer friendly. Reconnection is just "merge my ops into yours and vice versa" - guaranteed to converge. Figma, Linear, Notion-ish products use CRDTs.
Cons: ops carry more metadata (unique IDs, vector clocks). Tombstones for deletes accumulate - need garbage collection. Tree/structured CRDTs are still an active research area.

Recommendation for the interview
For new systems: pick a CRDT (Yjs, Automerge, or a custom one for your data model). Justify with: scale-out story, offline-first support, simpler reconnection logic.

For "we already have OT": don't switch unless you must. The OT codebase encodes years of corner-case fixes.

Concrete example - text inserts
State: "AC". User 1 inserts 'B' between A and C at position 1. Concurrently, user 2 inserts 'X' between A and C at position 1.

OT: server receives op1 (insert 'B' at 1), then op2 (insert 'X' at 1). Server transforms op2 against op1 → "insert 'X' at 2". Result: "ABXC".

CRDT: each char has a unique ID derived from position + replica ID. 'B' gets id (1, replica1); 'X' gets id (1, replica2). Tie-broken by replica ID. Both replicas converge to "ABXC" (or "AXBC" - both are valid as long as everyone agrees). The order is deterministic from the IDs.

2. Presence, cursors, and ephemeral state

Edits and presence travel together but have different durability requirements.

Edits: durable, must never be lost, replayable from log.
Presence: ephemeral, dropping a cursor update is fine, must be cheap.

Treating them the same wastes resources. Treating them too differently makes the protocol confusing.

The split

Edits flow through the durable op log. Every op is journaled before broadcast.
Presence (cursor, selection, "is typing", color) flows over the same WebSocket but is never persisted. Lost messages are replaced by the next update.

Cursor message rate
Naive: send a message on every cursor move (every keystroke). At 5 keystrokes/sec × 50 collaborators × 50 viewers = 12,500 messages/sec for a single hot doc. Untenable.

Optimizations:

Coalesce on the sender: send at most every 50ms (20Hz). The eye doesn't notice faster.
Drop intermediate updates on the broadcast side: if 5 cursor updates queue for the same user before flushing, send only the latest.
Don't broadcast cursor updates to disconnected viewers (obvious but easy to miss).

Selection ranges
A selection has a start and end. CRDTs reference these by character ID, not position - otherwise a concurrent edit shifts the selection visibly mid-action. The selection's start/end IDs survive other users' edits gracefully.

Awareness vs presence
Awareness = the broader state of "what is each user doing right now" (cursor, selection, scroll position, current tool, color). Presence = "who's here". Yjs models awareness as a CRDT-like ephemeral state per user, gossiped periodically.

Reaping disconnected users
WebSocket close events are unreliable on mobile (users walk into elevators). Each user's awareness has a TTL (~30 seconds); stale entries are reaped. Other clients see them disappear without explicit logout.

3. Offline edits and reconnection

The offline case is what separates production-grade collab from demoware. A user opens the doc, edits for 20 minutes on a plane, comes back online. Their edits must merge with the 200 ops their teammates added while they were gone.

Local-first model
Every client maintains a local CRDT replica. Edits apply to the local replica immediately - no server round trip. Edits are also queued for broadcast. The UI never blocks on the network.

Outbound queue
Pending ops sit in IndexedDB with the client's local vector clock. On reconnect, the client sends them in order. The server merges them into the canonical state and broadcasts to other clients.

Inbound replay
On reconnect, the client tells the server "my last seen vector clock is V". The server returns all ops since V. Client merges them into local state. Because CRDT ops commute, order doesn't matter, but causal order is preserved by vector clocks.

Conflict cases

User A deletes a paragraph offline. User B edits a sentence in that paragraph. On merge, the deletion wins (paragraph gone) but B's edits are preserved as orphan ops - typically discarded silently or surfaced in undo history.
User A renames a layer. User B moves the same layer's position. Both ops apply (different fields) - clean merge.
User A and user B both rename the same layer concurrently. Last-writer-wins by Lamport timestamp - but the loser's name is preserved in history so they can recover.

Long-offline scenarios
A user offline for a week comes back with thousands of local ops. Replaying them all on the server is O(N+M) where N is local and M is remote. For very long offlines, the client may need to download a snapshot first and apply local ops on top.

Snapshot-based reconnection
If the server's op log past V has been compacted away (e.g., V is older than retention), the server returns "go fetch snapshot at version W, then apply ops from W onward". The client downloads the snapshot, replays its local ops on top, and proceeds.

The "your changes were saved" UX contract
The user's mental model is "if I can see it on screen, it's saved". The reality is "it's saved when the server has acked it". The gap is bridged by:

Persistent local queue (IndexedDB) so ops survive browser close.
Visible sync indicator ("Saving...", "All changes saved").
On reconnect, the client doesn't show success until the server confirms.

4. Sharding documents and the hot-doc problem

One logical owner per active doc keeps merging simple, but uneven traffic creates hotspots.

Default sharding
Hash by doc_id → assign to one of N edit servers. Each server holds the active state for the docs it owns. Ops for a doc always route to its owner.

Pros: per-doc serialization is trivial (single owner, single thread per doc). No distributed coordination per op.
Cons: a hot doc (e.g., a Figma file open during a 50-person design review) sits on one server.

Hot-doc mitigation

Vertical: give the doc a beefier server (move it to a dedicated tier). Works up to a single-machine ceiling.
Per-doc fan-out gateways: the edit server stays the merge owner but offloads broadcast to N gateways, each handling a subset of viewers. The edit server publishes ops to a per-doc topic; gateways subscribe.
Read-only replicas: viewers (read-only mode) are served by replicas that lag the leader by ms. Editors stay on the leader.

Failover
If an edit server crashes, its docs need a new owner fast. Approaches:

Lease-based: each server holds a lease on its doc set. On crash, lease expires, another server picks up.
Coordination service (etcd, ZooKeeper) tracks doc → owner mapping; clients re-route on owner change.

Recovery time: target sub-second. The op log persists every op before ack, so no data loss; only a brief unavailability while the new owner reads the doc's state.

Cold doc lifecycle
Doc with no active editors for >5 minutes is unloaded. Last snapshot is finalized; in-memory state freed. On next open, server reads snapshot + tail of op log, rebuilds state. Cold open p99 < 500ms is the target.

Cross-region collab
A team distributed across continents needs all members to see each other's edits with low latency. Two patterns:

Single global owner per doc - one region holds the leader; cross-region clients pay one extra round trip.
Multi-leader CRDT - each region has a leader; they reconcile asynchronously. Adds complexity but cuts cross-region edit latency.

Most products do single global owner with the leader placed near the doc's most-active region. The few % of cross-region collaborators pay the cost.

5. Permissions, sharing, and the link-share trap

Edit permissions need to be enforced server-side on every op. Client-side checks are decorative.

Roles
Owner / Editor / Commenter / Viewer / None. Resolved per (user, doc).

Resolution path
On connect, the auth service computes the user's role for the doc. The edit server caches it for the session. Every op carries an implicit "I claim role X" check - server verifies.

Role changes mid-session: the auth service publishes invalidation events. The edit server re-checks on the next op.

Link sharing
"Anyone with the link can edit" is a sharing-mode bit on the doc. Clients without an explicit user-doc grant pass an opaque link token; auth service maps the token to a role.

Trap: link-shared docs can leak via accidental forwards or screenshots of URLs. Enterprise tier usually disables this and requires named sharing.

Comment-only mode
A specialized role: can read everything, can attach comments anchored to characters/regions, cannot mutate the document tree. Comments are a parallel CRDT (or a side-table) anchored by character ID so they survive edits.

Branching / forking
"Make a copy I can edit" is a doc-level operation: clone the snapshot at version V into a new doc with the user as owner. The two docs diverge from then on; there's no automatic re-merge.

Audit
Every op records (user_id, timestamp, op). The audit log is the legal record of who changed what. Stored append-only, retained per compliance requirements (often 7+ years).

6. Garbage collection: tombstones, snapshots, op log compaction

CRDTs accumulate metadata. Without GC, every doc grows linearly forever.

Tombstones
A delete op in a CRDT can't actually remove the underlying record - other replicas may have ops referencing it. Instead, the record is marked deleted (tombstoned). Tombstones must persist until everyone has seen the delete.

Strategy: when all replicas have advanced their vector clock past the delete op, the tombstone is GC-eligible. Implementing this requires a global "everyone has seen up to vector clock X" signal - usually approximated by "X is older than the longest plausible offline window" (e.g., 30 days).

Op log compaction
The log grows ~1 op per keystroke. A doc edited daily for years has tens of millions of ops. Replay is O(N) on cold open - eventually unbearable.

Mitigation: periodic snapshots. Every K ops or T minutes, take a compacted state snapshot. Cold open reads the latest snapshot and replays only the tail.

Snapshot retention

Latest 1-3 snapshots: hot, on SSD. Sub-second open.
Hourly snapshots: 90 days, warm storage. For point-in-time recovery.
Daily snapshots: 1+ year, cold storage. For audit.

Op log truncation
Once a snapshot covers up to op N, ops 0..N-1 in the live op log can be moved to cold storage. The hot log shrinks. Audit/history queries fall back to cold.

Per-user undo
Undo across users is intrinsically hard: undoing your op may conflict with someone else's later op. Most systems implement per-user "undo my last op" which works as long as the op is invertible and no later op depends on it. Beyond that, undo is best-effort or disabled.

The forever-open doc
A pathological case: a doc that's been open continuously for years with no idle window for cleanup. Two mitigations:

Force a periodic "compaction window" (e.g., during low-traffic hours, briefly pause new ops, snapshot, resume).
Online compaction: build the snapshot in the background while ops continue; atomically swap. More complex; standard for production CRDT libraries.

Trade-offs

CRDT vs OT
CRDTs win for new systems: scale-out, peer-to-peer, offline-first. OT wins where the codebase already exists and the use case is linear text. Pick CRDT and own the metadata cost.

Single owner per doc vs multi-leader
Single owner is simpler and serves >95% of products. Multi-leader is for cross-region collab where the latency win justifies the merge complexity.

Persistent vs ephemeral presence
Edits durable, presence ephemeral - this split saves ~10x on storage with no user-visible cost. Mixing them up is a common newbie trap.

Eager vs lazy snapshotting
Eager (every K ops) keeps cold open fast at the cost of write amplification. Lazy (only when the doc closes) saves writes but punishes cold readers. Production systems do eager + amortized.

Tombstone retention vs storage cost
Long retention (30+ days) makes offline reconnect bulletproof but doubles storage. Short retention (1 day) reduces storage but breaks long-offline merge. Pick based on the offline use case (mobile apps need long; web app users rarely need it).

Per-keystroke ops vs batched ops
Per-keystroke gives the most accurate undo and history but multiplies traffic. Batching (group keystrokes within 100ms into one op) cuts traffic 5-10x at minor history fidelity loss. Most products batch.

Strict ACL enforcement vs permissive UX
Server enforces every op. Some products optimistically apply edits client-side and roll back on rejection - UX feels snappier but flickers when permissions change mid-edit. Pick based on the rate of permission changes in your product.

Client-only CRDT vs server-mediated
Client-only (true peer-to-peer) is elegant but discovery, NAT traversal, and abuse make it hard for consumer products. Server-mediated CRDTs (the Figma model) keep the convergence guarantees while letting the server enforce policy.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

?How would you support a 200-person live editing session without melting the edit server?
?What's your protocol for a user who's been offline for 30 days?
?How would you implement "named versions" the user can restore from?
?What changes if the data model is a tree (Figma layers) instead of linear text?
?How do you keep cursor latency under 50ms for cross-region collaborators?
?How would you A/B test a change to the merge algorithm without corrupting documents?
?What's your story for a user editing the same doc from laptop and phone simultaneously?
?How do you handle a malicious client sending invalid ops at high rate?

Companies that test this topic

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →