Design Distributed Transactions (2PC, Saga, TCC)
Two-phase commit, sagas (choreography vs orchestration), TCC, idempotency keys, and the compensation logic that turns multi-service writes into something a customer-support agent can untangle.
The problem
Design a transaction protocol that spans multiple independent services - payments, inventory, shipping, notifications - where each owns its own database and you cannot use a single ACID transaction across them. The end-state must be consistent: either every service commits its piece, or every service rolls back. Failures (network blips, service crashes, duplicate retries) must not leave the system half-committed.
This is the question that separates "I read the saga blog post" from "I have run this in production". Strong candidates explain why 2PC is rare in practice, walk through a concrete saga (orchestration vs choreography), justify their idempotency story, and own the failure modes - what happens when a compensating action itself fails.
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →How many services participate in a typical transaction - 2, 5, 20?
- →Are the participating services all owned by us, or do some belong to third parties (Stripe, shipping carriers)?
- →What's the latency budget for the end-to-end transaction - 100ms, 1s, 30s?
- →Is partial visibility acceptable mid-transaction (e.g., order created but not yet paid), or must intermediate states be hidden?
- →Throughput - hundreds, thousands, or tens of thousands of transactions/sec?
- →Are compensations always possible, or are there irreversible steps (sending an SMS, charging a card)?
- →Strong serializability across services required, or is read-your-writes within a service enough?
- →What's the operator's escape hatch - a back-office tool to manually resolve stuck transactions?
Requirements
Functional requirements
- ·Begin a transaction spanning N services with a unique transaction ID
- ·Each step records its outcome (committed, failed, compensated) durably
- ·On any step failure, run the compensating action for every previously-committed step in reverse order
- ·Idempotent retries: a duplicate request for any step is a no-op, not a double-commit
- ·Visibility: observability of in-flight transactions, with operator tools for stuck ones
- ·Timeout policy: long-running transactions are detected and either compensated or escalated
- ·Audit log of every step transition for compliance and debugging
Non-functional requirements
- Scale
- 10K transactions/sec aggregate. Median txn touches 3-5 services; tail touches 10+. 99% complete in <1s; long-tail (waiting for human / external API) can run for hours.
- Latency
- Orchestrated saga p99 < 500ms for a 5-step happy path. Compensation chain p99 < 2s. 2PC commit phase p99 < 50ms intra-DC.
- Availability
- 99.9% for the coordinator. A coordinator outage stalls in-flight transactions but does not corrupt them - on recovery, the durable log resumes them.
- Consistency
- Eventual atomicity: every transaction reaches a terminal state (committed or fully compensated) within bounded time. No silent half-commits. Read isolation is per-service - cross-service consistency is best-effort during the saga window.
Capacity estimation
Transaction log
- 10K txn/sec × ~10 events/txn (start, step start, step ack, ...) × 1KB = 100 MB/s write to the txn log. ~8 TB/day uncompressed. Tier hot 7 days, warm 90 days, cold for compliance retention.
Coordinator state
- In-flight transactions: 10K/sec × 1s avg = 10K active. With long-tail (some run for hours), plan for ~100K-1M concurrent. Each ~5KB of state in memory + checkpointed to durable store.
Idempotency cache
- Each step request carries an idempotency key (UUID + step ID). Cache hit returns the prior outcome without re-executing.
- 10K txn/sec × 5 steps × 2 retries avg = 100K key lookups/sec. Cache: 30-day window, ~250B/entry × 26B requests = ~6 TB. Sharded Redis or DynamoDB with TTL.
Compensation rate
- Healthy systems compensate 0.1-1% of transactions. At 10K/sec, that's 10-100 compensation chains/sec. Each chain fans out to 1-N services - manageable on the same infra.
Coordinator throughput per node
- A single orchestrator node handles ~1K saga executions/sec when steps are RPCs (network-bound) and ~10K/sec for in-process state machines. Shard by transaction ID hash for horizontal scale.
Latency budget for a 5-step orchestrated saga
- 5 steps × (50ms RPC + 5ms log write + 5ms state advance) = 300ms steady-state happy path. Failure cases add the compensation chain (≤300ms more).
High-level architecture
The system layers a transaction coordinator on top of N participant services. The coordinator owns the durable transaction log; each participant owns its own database and exposes idempotent step + compensation APIs. The coordinator drives the state machine: start → step 1 → step 2 → ... → commit (or compensate-from-N → ... → compensate-from-1 → aborted).
The defining decisions: (1) coordinated (orchestration with a central saga executor) vs choreographed (each service publishes events and reacts), (2) compensating actions vs reservations (TCC), (3) where the idempotency boundary lives (coordinator-side keys vs participant-side dedup table), (4) what to do when a compensating action itself fails (escalation queue + human resolution).
The defining operational property: every transaction must reach a terminal state. A coordinator crash mid-saga is recoverable from the log; an unbounded "stuck" transaction is a bug, not a feature.
Saga coordinator (orchestrator)
Stateless workers driving each transaction's state machine. Reads next step from the log, calls the participant, writes the outcome back. Sharded by transaction ID. The brain of the orchestration model.
Transaction log
Durable append-only record of every state transition (txn started, step started, step committed, compensation triggered). Source of truth for recovery. Kafka, DynamoDB streams, or a relational outbox table.
Participant service
Owns its database. Exposes a step API (do-the-thing) and a compensation API (undo-the-thing). Both must be idempotent on the request key. Examples: payments service, inventory service, shipping service.
Idempotency store
Maps (transaction_id, step_id) → outcome. Consulted on retry; populated on first ack. TTL'd at 30-90 days. Lives inside the participant or as a shared infra service.
Event bus (for choreography)
When using choreography, services publish 'step completed' / 'step failed' events. Other services subscribe and act. Kafka or a managed pub/sub. Optional in pure orchestration setups.
Compensation queue
When a step fails after others committed, the coordinator enqueues compensation jobs in reverse order. Jobs retry with backoff; persistent failures escalate to the human queue.
Stuck-transaction monitor
Background scanner finds transactions that have been in a non-terminal state past their SLA. Alerts on-call and surfaces them in the operator UI for manual resolution.
Operator console
Read-only view of any transaction's state machine + per-step audit. Action buttons for force-compensate, force-commit, manual retry. Every action audit-logged.
Outbox publisher
For services that want atomicity between local DB write and event publish: write event row in same transaction as state change; background publisher reads the outbox and publishes to the bus. Defeats the dual-write problem.
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. 2PC vs 3PC: why two-phase commit is rare in practice
Two-phase commit is the textbook answer and almost never the production answer. Understanding why is the interview signal.
2PC mechanics
Phase 1 (prepare): coordinator asks every participant "can you commit?". Each writes its intent durably and votes yes/no.
Phase 2 (commit/abort): if all voted yes, coordinator says "commit". If any voted no (or timeout), "abort". Participants apply and ack.
Why 2PC works on paper
ACID across N participants. Atomic - either all commit or none does. Strong consistency.
Why it falls over in production
- Blocking on coordinator failure. If the coordinator crashes between phase 1 and phase 2, every participant is stuck holding locks waiting for the decision. Recovery requires reading the coordinator's log; if that log is also lost, manual intervention.
- Long lock holds. Each participant locks the affected rows from prepare to commit. Cross-service round-trips push lock duration into the seconds. Throughput collapses under contention.
- Coordinator is a single point of failure. Replicating the coordinator is hard - the coordinator's own state needs consensus.
- Doesn't span heterogeneous DBs. 2PC requires every participant to support the prepare protocol (XA / 2PC interface). Most modern services don't.
- Doesn't span external APIs. Stripe, Twilio, shipping carriers - none of them support being a 2PC participant.
3PC
Adds a "pre-commit" phase to remove the blocking case (in theory). In practice, 3PC requires synchronous network and bounded message delays - assumptions that don't hold on real networks. Almost no real system uses 3PC.
Where 2PC still makes sense
- Same database, multiple shards (e.g., distributed transactions in Spanner, CockroachDB, YugabyteDB) - the DB is the coordinator and participants, lock times are sub-ms, recovery is built-in.
- Within a single trusted datacenter with XA-supporting databases for legacy enterprise integrations.
- Never across services on the open internet, never with third-party APIs in the loop.
The interview answer
"2PC is the right model when you control all participants, they all support prepare/commit, and lock duration is sub-millisecond. For cross-service workflows in microservices or with third-party APIs, sagas with idempotency are the production answer."
2. Saga: choreography vs orchestration
The saga pattern: replace one big distributed transaction with a sequence of local transactions, each followed by a compensating action if a later step fails.
Choreography
Each service listens to events and reacts. No central coordinator. Service A commits step 1 → publishes "step 1 done" → service B sees it, commits step 2 → publishes "step 2 done" → ...
If a step fails, the failing service publishes a "saga failed at step N" event; earlier services subscribed to this event run their compensations.
Pros:
- No central coordinator to scale, deploy, or debug.
- Services are autonomous; new participants join by subscribing.
- Naturally event-driven; fits well with Kafka-centric architectures.
Cons:
- The full saga lives nowhere as code - it's emergent from service subscriptions.
- Hard to reason about: "what happens if step 3 fails?" requires tracing event flows across services.
- Hard to evolve: changing the saga shape means coordinated changes across many services.
- Cyclic event dependencies are easy to introduce and hard to detect.
Best for: <5-step sagas, small set of stable services, strong event-driven culture.
Orchestration
A central orchestrator (the saga executor) drives the state machine. It calls service A, waits for ack, calls service B, etc. On failure, it calls the compensations in reverse order.
Pros:
- The saga is one piece of code. Easy to read, test, version.
- Failure handling lives in one place.
- Easy to add steps, branches, conditionals (skip step 3 if condition X).
- Natural place for timeouts, retries, escalation.
Cons:
- Orchestrator becomes a bottleneck and a deployment chokepoint.
- Risk of "orchestrator becomes a god service" that knows every business rule.
- Coupling: the orchestrator must know all participant APIs.
Best for: 5+ step sagas, complex branching, multi-team environments where one team owns the workflow.
The hybrid
Many production systems use orchestration for the main flow + choreography for side effects. The orchestrator drives the critical path (charge → reserve inventory → ship); a "shipment created" event triggers downstream non-critical reactions (analytics, notifications, recommendations refresh).
Recommendation
Default to orchestration for new systems. The debuggability win dominates the architectural-purity loss. Choose tools (Temporal, AWS Step Functions, Cadence, Camunda) that codify the state machine and persist progress automatically.
3. Compensating actions: writing the undo and what 'undo' even means
Sagas trade ACID rollback for compensating actions. The compensation problem is harder than the commit problem.
Semantic vs literal undo
You don't undo "charge $100" by un-charging - you issue a refund. The refund is a new operation with a new ID. The customer sees a charge and then a refund on their statement. Literal time-reversal isn't available; semantic reversal is.
This affects every step:
- Reserve inventory → release reservation.
- Send confirmation email → send a "we couldn't complete your order" email (not "delete the previous email").
- Create user account → mark account as deleted (or actually delete - business call).
- Allocate a coupon code → return it to the pool.
Steps that have no compensation
Some actions are irreversible: sent SMS, sent push notification, fired-off webhook to a third party that isn't transactional. The saga design must either:
- Move irreversible steps to the end (after the last possible failure point).
- Convert them to two-phase: schedule the SMS, then commit the schedule once the rest succeeds.
- Accept the "best effort" with operator notification when irreversible actions can't be reversed.
Writing the compensation
The compensation is itself a service operation. It must be:
- Idempotent: re-running compensation is a no-op.
- Best-effort durable: persist intent to compensate; retry until success.
- Auditable: log the original action and its compensation linked by transaction ID.
Compensation failures
What if the compensation itself fails (network down, downstream rejecting refund)? Options:
- Retry with backoff: most transient failures resolve within minutes.
- Escalate to human queue: after N retries, flag for operator. Show the original transaction, the failed steps, the failed compensation.
- Forward recovery: instead of reversing, push the transaction forward through an alternative path (e.g., refund failed → issue store credit instead).
The interview signal: candidates who treat "compensation always works" as an axiom score lower than candidates who design the operator queue for the cases where it doesn't.
The classic example - hotel + flight booking
Saga: book flight → book hotel → charge card → confirm.
Step 4 (confirm) fails (e.g., user changes mind, payment processor declines).
Compensations in reverse: refund card → cancel hotel → cancel flight.
Each cancellation may itself fail (hotel cancellation deadline passed, flight non-refundable). Real systems convert "cannot cancel" into a business policy: the customer is on the hook for the non-refundable component, the saga reaches a "partially compensated" terminal state, and a human reviews edge cases.
4. TCC (Try-Confirm-Cancel): reservations as a 2PC alternative
TCC is the saga pattern's pragmatic cousin. Each step splits into two phases: a Try (reserve) and a Confirm/Cancel (commit/release). Looks like 2PC, behaves like a saga.
The pattern
Phase 1 (Try): each service reserves the resource without making it user-visible. Inventory is decremented from "available" but moved to "reserved", not yet "sold". Card is authorized but not captured.
Phase 2 (Confirm or Cancel): once all Try steps succeed, coordinator calls Confirm on each. Reservations become real: reserved inventory becomes sold, auth becomes capture. If any Try fails, coordinator calls Cancel: reservations release, auths void.
Why this is better than naive saga + compensation
- Try is a soft commit - cancellation is cheap and fast (release a reservation, void an auth).
- Confirm is fast (move state from reserved to confirmed - typically a single field flip).
- The window of inconsistency is short and contained. No customer sees an order before Confirm.
Why this is better than 2PC
- No cross-service distributed locks. Each service's local Try is its own local transaction.
- Reservations expire automatically if Confirm doesn't come. No coordinator-crash blocking problem.
- Heterogeneous services can implement Try/Confirm/Cancel without a 2PC interface.
Where TCC fits
- Inventory + payment workflows where reservations are a natural concept.
- Booking systems (hotels, flights, restaurants).
- Resource allocation (compute provisioning, IP allocation, license activation).
Where TCC doesn't fit
- Services without a natural "reservation" concept (e.g., sending an SMS - you can't reserve sending).
- Services owned by third parties that don't expose Try semantics.
Implementation gotchas
- Reservation TTL: every Try writes an expiry. If Confirm doesn't arrive by then, the reservation auto-cancels. This protects against coordinator crashes.
- Idempotent Confirm/Cancel: like all distributed steps, Confirm and Cancel must be idempotent on the reservation ID.
- Hanging operations: a Try succeeds, the coordinator crashes, the Confirm arrives after the TTL expired. Service must reject the Confirm; coordinator must detect and treat as failure.
Comparison summary
- 2PC: synchronous distributed locks. Strong consistency. Doesn't scale, blocks on failures.
- Saga + compensate: each step commits locally, compensate on failure. Scales, but compensations can be expensive or impossible.
- TCC: each step Try-reserves locally, Confirm/Cancel on coordinator decision. Scales, cancellations are cheap, but requires the Try semantics.
Pick TCC when reservations are natural; saga otherwise.
5. Idempotency: keys, dedup tables, and the at-least-once contract
Distributed transactions are at-least-once. Without idempotency, a single transient failure double-commits. Idempotency is not optional.
The contract
Every step request carries an idempotency key. The participant guarantees: for a given key, the operation is performed at most once. Subsequent requests with the same key return the prior result.
Where the key comes from
- Coordinator-generated: orchestrator computes
(transaction_id, step_id)as the key. Stable across retries. - Client-generated: the inbound API request carries an idempotency key, propagated to every downstream call.
Stripe, Square, Adyen all expose Idempotency-Key headers - clients must generate stable keys to avoid double-charging.
Implementation: dedup table
Each participant maintains a table:
PK: idempotency_key
attributes: outcome (success/failure), result_payload, created_at
TTL: 30-90 days
On every request:
- Look up the key. If present, return the cached outcome. (Optionally hash the request body to detect key reuse with different params - reject as a misuse.)
- If absent, execute the operation.
- Atomically write the outcome. If two concurrent requests arrive with the same key, one wins the write; the other returns the winner's result.
The atomic write is critical. Race condition: two requests both find no entry, both execute, both write. Now you've committed twice. Solutions:
- Conditional put (DynamoDB
if_not_exists, PostgresINSERT ... ON CONFLICT DO NOTHING RETURNING). - Distributed lock per key (Redis SETNX) wrapping the execute + write.
- Single-writer per key partition (use the key as a Kafka partition key, one consumer per partition).
TTL choice
The key must outlive the maximum retry window. Stripe defaults to 24h; orchestration platforms typically retry over hours; for human-in-loop workflows that can take days, extend the TTL. Past the TTL, the system reverts to non-idempotent behavior on duplicates - a known and documented limitation.
Idempotency at every hop
The coordinator, the participant, and any internal proxy/queue must all be idempotent on the same key. A queue that delivers at-least-once + a non-idempotent consumer = duplicates. Audit the entire request path.
The "I retried but my server already committed" pattern
Common bug: client times out, retries with the same idempotency key, server returns the cached "success" result from the first attempt. Looks fine. But the client's first attempt actually got a network failure on the response - the server committed, the client never knew. The retry is what surfaces success.
This is exactly the case idempotency is designed for. Clients must be willing to retry timeouts and trust that the idempotency key prevents double-execution.
Side effects in idempotency
Pure operations are easy. Operations with side effects (send email, push notification, write to a third-party API) need careful design:
- Make the side effect itself idempotent if possible (e.g., send the email with a stable Message-ID header).
- Or: separate the side effect from the idempotent record. The idempotent record marks "we sent the email at time T to user X"; replaying retrieves the record without re-sending.
The interview signal: candidates who own the at-least-once delivery model and design idempotency at every layer score higher than candidates who hand-wave with "exactly-once".
6. When to give up: choosing eventual consistency over distributed transactions
The most senior answer is sometimes "we don't need a distributed transaction here". Recognizing when scores higher than implementing one well.
The cost of distributed transactions
- Engineering complexity: orchestration, idempotency, compensation, observability, operator tools. Months of work to build, ongoing cost to maintain.
- Latency: every cross-service hop adds to user-visible latency.
- Failure modes: more parts means more partial-failure cases to design for.
- Debugging: a failed multi-service workflow is harder to investigate than a failed single-service operation.
When the cost is worth it
- Money is at stake and the user expects atomicity (charge + ship, transfer between accounts).
- Compliance demands it (financial settlement, healthcare orders).
- Inconsistent state visible to users would be unrecoverable (user confused, support overwhelmed).
When the cost isn't worth it
- User-visible state can tolerate seconds-to-minutes of inconsistency.
- The "transaction" is actually two loosely-related events (e.g., "user signed up" → "send welcome email" - these don't need atomicity; if email fails, retry later).
- Eventual consistency is the natural model (e.g., search index updates, recommendation refresh, analytics events).
Patterns that avoid distributed transactions
- Outbox: write the local change and an event in the same DB transaction; background publisher emits the event. Downstream consumers eventually update. No coordinator needed.
- Event-driven materialized views: each service owns its data; cross-service views are eventually-consistent denormalizations updated from events.
- Inbox/outbox + idempotent consumers: at-least-once delivery + idempotent processing = effectively-once outcomes without coordinator.
- Reservation systems with timeouts: lightweight TCC where reservations naturally expire (e.g., a 15-minute checkout cart). Doesn't need a coordinator at all.
- Compensation as business policy: rather than auto-compensate on every failure, surface to an operator queue. Cheaper to build, often what users actually want anyway.
The "single service per transaction" refactor
Sometimes the right answer is to consolidate ownership: move two services' state into one service so the multi-service transaction becomes a single-service transaction. This is anathema to microservice purity but often the most reliable fix. Stripe, Square, and Coinbase all have core "ledger" services that own all financial state precisely so they don't need cross-service transactions.
The hybrid in practice
Real systems use distributed transactions for the small core of operations that demand them (checkout, payment, transfer) and eventual consistency for everything else (notifications, analytics, derived views, search indexes). The interview signal: candidates who can articulate the line, not just defend one side.
Trade-offs
2PC vs saga
2PC for shared-DB sharded transactions; saga for cross-service workflows. 2PC across services is a 2010 idea that didn't survive contact with cloud microservices.
Orchestration vs choreography
Orchestration for >3 steps and complex error handling; choreography for short event-driven flows where the "saga" is implicit. Default to orchestration; the debuggability dominates.
TCC vs compensation
TCC where reservations are natural (inventory, bookings, resource allocation); compensation where they aren't (irreversible actions, third-party calls). TCC has cleaner failure semantics; compensation is more flexible.
Coordinator-side vs participant-side idempotency
Coordinator-side keys (the orchestrator generates them) are easier to reason about; participant-side keys force every service to manage its own dedup table. Most systems do both - the coordinator owns the saga key, each participant owns its dedup table for the step.
Synchronous vs async sagas
Synchronous (orchestrator awaits each step) is easier to reason about and gives faster latency for happy paths. Async (events drive each step) scales better and survives orchestrator restarts more gracefully. Mature platforms (Temporal, Step Functions) hide the distinction.
Fast compensation vs forward recovery
Fast compensation (rollback everything on failure) is the default. Forward recovery (re-route through an alternative path) is more user-friendly but requires per-step alternatives - useful for high-value transactions (payment fallback to a secondary processor).
Strict vs lenient timeouts
Strict timeouts (compensate after N seconds) prevent stuck transactions but trigger compensation for transient delays. Lenient timeouts (give the long tail more room) let real failures linger. Match the timeout to user expectations and operator SLA.
Build vs buy a saga platform
Buy: Temporal, AWS Step Functions, Cadence, Camunda - they handle persistence, replay, monitoring, retries. Build: a custom orchestrator over Kafka. Below ~10 sagas, build feels lighter; above, the platform pays for itself in operability.
Eventual consistency vs distributed transaction
Default to eventual consistency. Reach for distributed transactions when money or compliance demands it. The senior signal is recognizing the cases where saga complexity isn't earning its keep.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?What happens when a compensation fails 10 times in a row?
- ?How would you handle a saga that involves a webhook to a third party that doesn't support cancellation?
- ?What's your story for a coordinator deploy mid-saga?
- ?How do you observe and debug a saga stuck at step 3?
- ?How would you implement parallel saga steps (do A and B concurrently, then C)?
- ?What changes if the same transaction can be retried by the user (e.g., 'retry payment' button)?
- ?How would you migrate a 2PC-based legacy workflow to a saga without downtime?
- ?What's your strategy for capacity-planning the orchestrator for a 10x traffic spike?
Related system design topics
Payments
HardIdempotency keys, double-spend prevention, the ledger model, and why eventual consistency is wrong for balances. The interview where ambiguity costs you money.
Consensus
HardRaft leader election, log replication, snapshots - and the CAP theorem in operational practice. The substrate every other distributed system stands on.
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →