gitGood.dev
Payments

Design a Payments / Checkout System (Stripe-style)

Hard Premium

Idempotency keys, double-spend prevention, the ledger model, and why eventual consistency is wrong for balances. The interview where ambiguity costs you money.

The problem

Design a payments backend that accepts charge requests, talks to one or more payment processors (card networks, ACH, wallets), records the result in a durable ledger, and exposes status to callers via API responses and webhooks. Support refunds, partial refunds, retries on transient processor errors, and reconciliation against processor settlements.

This problem is graded on correctness obsession, not throughput. Strong candidates open with the two failure modes that ruin a payments system - duplicate charges and lost charges - and design every component to make each impossible. Weak candidates focus on scale and miss the actual interview signal.

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

  • Are we the merchant of record (Stripe) or the merchant (Shopify checkout integrating Stripe)? The trust boundary changes everything.
  • Which payment methods - cards only, ACH/SEPA, wallets (Apple/Google Pay), BNPL, crypto? Each has different async vs sync semantics.
  • What's the scale - charges/sec, peak Black-Friday burst, total dollar volume?
  • What's the latency budget for a card auth - p99 < 2s typical (network round-trip to Visa/Mastercard dominates).
  • Is full PCI scope acceptable, or must card data never touch our servers (tokenization required)?
  • Multi-currency? Multi-region? Cross-border settlement?
  • What's the dispute / chargeback flow - automated, manual review, or hybrid?
  • How long do we need to store transaction history - 7 years (financial retention) is the usual answer.

Requirements

Functional requirements

  • ·POST /charges - accept a charge request with idempotency key, return charge_id + status (succeeded / pending / failed)
  • ·POST /charges/{id}/refunds - full or partial refund of a captured charge
  • ·GET /charges/{id} - retrieve current status
  • ·Webhooks delivered to merchant on terminal state changes (charge.succeeded, charge.refunded, charge.disputed)
  • ·Async settlement: nightly batch reconciles processor reports against our ledger
  • ·Support for auth + capture (two-step) and auth+capture (one-step)

Non-functional requirements

Scale
10K charges/sec steady, 100K charges/sec Black Friday peak. 100M charges/day. 7-year retention = ~250B charge records. Multi-region active-active for resilience.
Latency
Charge p99 < 3s end-to-end (processor round-trip is 1-2s of this; everything else must fit in <1s). Refund p99 < 2s. Webhook delivery p99 < 30s after the underlying state change.
Availability
99.99% for the charge API. A 30-minute outage during a flash sale costs the merchant millions and ends contracts. The ledger and idempotency store must be more available than the API.
Consistency
Strongly consistent for balance, charge state, and idempotency. A charge either happened or it didn't - no eventually-consistent maybe. Reporting and analytics can be eventually consistent, but the source-of-truth ledger cannot.

Capacity estimation

Storage

  • Ledger entries: each charge produces 2-4 ledger entries (debit + credit, plus fees). 100M charges/day × 4 entries × 500 bytes = 200 GB/day = ~75 TB/year. Over 7 years with replication and indexes: ~2 PB.
  • Idempotency keys: 100M/day × 200 bytes × 24-hour retention window = 480 GB hot. Bigger if retention is longer (Stripe holds keys for 24 hours by default).
  • Webhook delivery log: every attempt (success + retries). ~5x charge volume = 500M events/day × 1KB = 500 GB/day.

Throughput

  • 10K charges/sec steady, 100K peak. Each charge = 1 idempotency check + 1 processor RPC + N ledger writes + 1 webhook enqueue. Net DB ops: ~50K writes/sec steady, ~500K peak.
  • Webhook delivery: 50K/sec steady (charges + state changes). Retries amplify by ~3x worst case.

Network

  • Card auth round-trip to processor: ~1-2 seconds, ~5KB on the wire. 10K/sec × 5KB × 2 (request + response) = 100 MB/s outbound to processors. Per-region.
  • Internal RPCs: dominated by ledger writes. ~50KB/charge of internal traffic = 500 MB/s steady.

Fanout

  • Each charge state change can trigger multiple webhooks (merchant + Connect platform + auditor). Fanout factor: 1-5x. Plan for 5x of charge throughput on the webhook bus.

High-level architecture

The system is built around three invariants that override everything else:

  1. No duplicate charges - the same logical operation, retried by a flaky client, must charge the customer exactly once.
  2. No lost charges - if the customer's card was charged, our ledger must reflect it, even if our process crashed mid-flight.
  3. Reconcilable with the processor - at any point, our ledger must match what the processor says happened, modulo a known reconciliation window.

The architecture splits cleanly along the sync vs async boundary. Sync path: API gateway → idempotency check → ledger pre-write → processor call → ledger finalize → response. Async path: webhook delivery, settlement reconciliation, fraud scoring, notifications.

Critical observation: the processor call is the only step that touches money. Everything before is preparing to call it; everything after is recording what happened. The hardest design problem is recovering when the processor call succeeds but our process dies before recording the result.

API Gateway

Terminates TLS, enforces auth (API key / OAuth), rate limits per merchant, routes to the charge service. Strips and tokenizes raw card data so it never reaches our application logic (PCI scope reduction).

Tokenization service (PCI-isolated)

Lives in a separate, locked-down VPC with PCI compliance scope. Accepts raw card numbers (PAN), returns opaque tokens. Only this service has access to encryption keys for card data. Audited separately, smaller team.

Idempotency service

Stores (idempotency_key, request_hash, response) tuples for 24 hours. Enforces 'same key + same request = same response; same key + different request = error'. Backed by a strongly consistent KV store (DynamoDB conditional write, Spanner).

Charge orchestrator

The state machine. Stateless workers; state lives in the ledger and the idempotency store. Drives a charge through pending → processor_called → succeeded/failed. Owns the recovery logic for in-flight failures.

Ledger service

Source of truth for money. Append-only, double-entry bookkeeping. Every charge produces matched debit/credit entries. Strongly consistent (single-leader Postgres or Spanner). All balance queries derive from the ledger - never store balances separately.

Processor adapter pool

Per-processor adapters (Stripe, Adyen, Braintree, ACH, wallets). Normalizes processor APIs into a common interface. Handles processor-specific retry semantics, error code mapping, and webhook signature verification.

Webhook delivery service

Reliably delivers state-change events to merchants. Exponential backoff retries (1m, 5m, 30m, 2h, 8h, 24h, 72h). Signs payloads (HMAC). Tracks delivery state per merchant endpoint. Dead-letter after N attempts.

Reconciliation pipeline

Async. Daily, ingests processor settlement reports (Stripe balance transactions, Visa BASE II files). Joins against the ledger. Flags discrepancies for ops review. The trust circuit between us and the network.

Fraud scoring service

Synchronous (in the charge path) for high-risk decisions, async for ML-based scoring. Returns approve / review / decline. Decline at this layer never touches the processor (saves processor fees on obvious fraud).

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. Idempotency: the foundation everything else stands on

Network failures are the default, not the exception. A merchant's server submits a charge, the response times out, the merchant retries. Without idempotency, the customer is charged twice. With idempotency, the second request returns the original response.

Client contract
Every mutating request includes an Idempotency-Key header (UUID). Stripe's contract: 24-hour window, replay returns the original response (including original status code), reusing a key with a different request body returns 409 Conflict.

Server-side flow

  1. Hash the request body. Compute (idempotency_key, request_hash).
  2. Conditional write to the idempotency store: SET key → (request_hash, status="in_progress", started_at) IF NOT EXISTS.
  3. If the conditional write fails (key exists):
    • If existing.request_hash != current.request_hash → 409 Conflict.
    • If existing.status == "completed" → return existing.response.
    • If existing.status == "in_progress" → return 409 with retry-after, OR block-and-poll (trade-offs below).
  4. Otherwise, process the charge. On completion, atomically update: SET key → (request_hash, status="completed", response).

The in-progress trap
If our process dies after step 2 but before step 4, the key is stuck "in_progress" forever. Two solutions:

  • TTL on the in-progress state (e.g., 60 seconds). After expiry, it can be reclaimed.
  • Recovery worker: scan for in-progress keys older than threshold, query the processor for actual state, finalize.

The processor query is the safety net: even if our records are wrong, we can ask Visa "did this auth succeed?" and recover ground truth.

Storage choice
DynamoDB conditional writes are perfect here. Spanner if you need cross-region linearizability. Don't use Redis - it's not durable enough for this; a Redis failure should not cause double-charges.

Key scope
Idempotency keys are scoped to (merchant_id, key) - merchants can use the same UUID without collision. Scope is critical for multi-tenant systems.

2. Double-spend prevention and the ledger

The ledger is the source of truth for money. Get this wrong and you cannot recover. The discipline is double-entry bookkeeping borrowed from accounting.

Double-entry model
Every transaction produces matched debits and credits across accounts. A $100 charge:

  • DEBIT customer_account: -$100
  • CREDIT merchant_account: +$97 (after fees)
  • CREDIT platform_revenue: +$3 (fee)

Sum across all accounts is always zero. Balance of any account = sum of entries.

Why double-entry

  1. Auditability - every cent is accounted for, with two-sided proof.
  2. Reconciliation - matching against external reports is mechanical.
  3. Bug detection - if balances don't sum to zero, you have a bug, and you find it immediately.

Append-only
The ledger is never updated. Refunds, reversals, adjustments are new entries that net against the original. This preserves history for auditors and regulators (mandatory under SOX, PCI, KYC/AML).

Strong consistency required
A user with $100 balance who initiates two simultaneous $80 charges should have one succeed and one fail - not both succeed leaving them at -$60. The naive read-then-write loses to the race; the correct primitive is a conditional update or a transactional ledger write with row-locking.

For Postgres: SELECT FOR UPDATE on the account row, then INSERT into ledger_entries, then UPDATE the cached balance. Wrap in a transaction.

For DynamoDB: optimistic concurrency on the account record (version number), retry on conflict.

For Spanner: just use a transaction; it handles serialization.

Why not eventually consistent?
Eventual consistency for balances means double-spends are possible during the inconsistency window. Even a 100ms window is enough to be exploited by a malicious user. The cost of strong consistency (10-50ms write latency, geographic constraints) is non-negotiable for the money-touching path. Reporting and analytics can be eventually consistent because they're read-only.

Sharding
Ledger sharded by account_id (or merchant_id). All entries for one account live on one shard - intra-account transactions are single-shard, fast. Cross-account transactions (between two merchants on the platform) need a 2-phase commit or a saga pattern - rare, but the design must accommodate.

3. The split-brain failure: processor said yes, we crashed

The hardest failure mode in payments. We sent the charge to Visa. Visa charged the card. Our process crashed before recording the result. Now we have an unrecorded charge - the customer was charged, our system thinks it never happened.

Why this is unavoidable
The processor call and our local commit cannot be one atomic operation. The network sits between them. Any sufficiently long pause between "charge succeeded" and "wrote to our ledger" is a window for failure.

Mitigation 1: pre-write before processor call
Before calling the processor, write a "pending" entry to the ledger with a unique processor_request_id. After the processor responds, update to "succeeded" or "failed". On crash recovery: scan for "pending" entries older than threshold, query the processor with processor_request_id, reconcile.

Mitigation 2: idempotent processor calls
Modern processors support idempotency keys themselves. Stripe accepts an Idempotency-Key on charge requests; replaying with the same key returns the original result without double-charging. Use this. The processor's idempotency window is your safety net for in-flight retries.

Mitigation 3: reconciliation
Daily reconciliation against processor settlement reports catches anything missed by the above. A charge in the processor's report but not in our ledger triggers an ops alert and a backfill.

Mitigation 4: don't acknowledge to the merchant until ledger is durable
The order is: pre-write pending → call processor → write final state → respond to merchant. If we respond before the final write, the merchant might think the charge succeeded while our system has nothing recorded. The merchant should never see "succeeded" without our ledger reflecting it.

The 0.001% case
Even with all of the above, there's a tiny window where the processor charged the card and we never recovered the response (processor outage during reconciliation, etc.). Final fallback: every charge has a customer_visible receipt path. Customers who notice a charge that doesn't appear in the merchant system file disputes; ops investigates. This is the manual failsafe.

The interview signal: candidates who name this failure mode and discuss the layered defenses score significantly higher than those who treat the processor call as a single atomic operation.

4. PCI scope reduction and tokenization

Card data (PAN, CVV, expiry) is heavily regulated. Any system that touches it falls under PCI-DSS - quarterly audits, network segmentation, encryption everywhere, access logging, key rotation. The audit cost is enormous.

The architecture answer: minimize what touches card data
Most of the system should never see a PAN. Tokenization confines card data to a small, hardened subsystem.

Tokenization flow

  1. Browser collects card data via Stripe.js / Elements (a JavaScript library hosted by the processor). The card data is sent directly from the browser to the processor; our servers never see it.
  2. Browser receives an opaque token (e.g., "tok_visa_4242"). Token is meaningless without processor's keys.
  3. Browser submits the token to our backend along with the order.
  4. Our backend uses the token in subsequent processor calls.

This pushes ~95% of our infrastructure out of PCI scope. Only the processor's frontend SDK and our token-handling code are in scope.

Network-isolated tokenization service (if we are the processor)
If we are Stripe (the processor), we have to handle PANs ourselves. Architecture:

  • Tokenization service in a separate VPC with no outbound internet, no SSH, encrypted disks, key management via HSM.
  • Application services call tokenization to convert PAN → token at write, token → PAN at processor-bound calls (network call to issuing bank).
  • All access to the tokenization service is logged and reviewed.
  • Application services store only tokens, never PANs.

Vault encryption
Data at rest in the vault is encrypted with a per-record DEK (data encryption key) wrapped by a master KMS key. KMS is HSM-backed (AWS KMS, GCP KMS, or HashiCorp Vault with HSM). Rotating master keys requires re-wrapping DEKs; rotating DEKs requires re-encrypting records.

Why this matters at interview time
Interviewers from payments companies probe PCI awareness. Saying "we use Stripe Elements to tokenize on the client" or "we have a network-isolated vault" earns serious credibility. Saying "we encrypt the database column" reveals a candidate who hasn't worked in payments.

5. Webhooks: at-least-once with the merchant's pain in mind

Webhooks are how the merchant learns about state changes (charge.succeeded, refund.completed, dispute.opened). They're an async distributed system in their own right, and a frequent source of merchant pain.

Delivery guarantee: at-least-once
We promise to deliver each event at least once. We do not promise exactly-once - that's the merchant's problem to handle via idempotent event_id processing.

Retry schedule
Standard pattern: exponential backoff with jitter. Stripe's schedule is roughly: immediate, 5 min, 30 min, 2 hr, 12 hr, 24 hr, then 24-hour intervals up to 3 days. After ~17 attempts over 3 days, mark as undeliverable; surface in the merchant dashboard.

Signing
Every webhook request includes an HMAC signature (X-Stripe-Signature header) over the payload + timestamp using a per-merchant signing secret. Merchants verify the signature to ensure the event is genuinely from us, not a forged attack on their endpoint.

Per-endpoint queuing
Each merchant endpoint has its own delivery queue. A slow or failing endpoint can't back-pressure other merchants' deliveries. Within an endpoint, events are delivered in order (per object_id ordering matters for state-machine consumers).

Failure modes the merchant will hit

  1. Their endpoint is slow → our retries pile up. Mitigation: per-endpoint concurrency limit (e.g., 10 in flight); circuit-breaker if error rate spikes.
  2. Their endpoint returns 5xx → retry per schedule.
  3. Their endpoint returns 4xx (other than 429) → still retry; many merchants accidentally return 400 for known events.
  4. Their endpoint is unreachable for 3 days → mark undeliverable. Merchant can replay missed events from the dashboard.

Backfill
Merchants can request a replay of all events for a time window. Critical for new integrations (catch up from when they signed up) and for bug recovery (their system was down; replay to refill state).

Out-of-order delivery
Despite ordered queuing, network reordering means merchants can see refund.created before charge.succeeded if they retry stuck events. Merchants must reconcile against final state, not assume in-order events.

6. Async vs sync flows: cards vs ACH vs BNPL

The "charge" abstraction hides wildly different settlement timelines depending on payment method.

Cards (sync)
Authorization happens in real time (~1-2 seconds). Capture happens immediately or up to 7 days later. Settlement (money actually moves) happens 1-3 business days after capture. The merchant sees "succeeded" within seconds; the money lands days later.

ACH (async)
Authorization is essentially trust-based - we submit a debit request to the bank, and 3-5 business days later we learn whether it cleared. The "charge" succeeds initially as "pending" and only confirms days later. Reversals (NSF, returned ACH) can happen weeks later.

BNPL / installments
Customer is approved by the BNPL provider in real time (sync), but the merchant is paid in full upfront and the customer pays the BNPL provider in installments. From our system's view: sync at charge time, but the customer's actual payments happen over months.

Wallets (Apple Pay, Google Pay)
Sync from the user's perspective (face ID / fingerprint). Underlying mechanism is a tokenized card; same settlement timeline as cards.

Crypto (async, irreversible)
Customer initiates a transfer; we wait for N confirmations (10 min - 1 hour for BTC; less for ETH); then mark succeeded. Once confirmed, irreversible - no chargebacks.

Design implication: state machine per payment method
Each payment method has its own state machine. Charge.status can be pending, processing, requires_action, succeeded, failed, refunded. Different methods take different paths. The orchestrator dispatches to method-specific handlers.

Webhook implication
Async methods produce many state changes (created → pending → processing → succeeded → potentially_reversed). Each change is a webhook. Merchants need to handle the full lifecycle, not just "succeeded".

3DS / SCA (regulatory async)
European PSD2 requires Strong Customer Authentication for many card transactions. The flow: charge starts → bank requires 3DS challenge → user redirects to bank's auth page → returns to merchant → charge completes. This injects an async step into a previously-sync flow. State: requires_action → user_completed → succeeded.

7. Refunds, disputes, and reversal accounting

Refunds are not just "reverse the charge". They have their own lifecycle, accounting, and edge cases.

Refund types

  • Full refund: returns the full charge amount to the customer's card.
  • Partial refund: returns a portion. Multiple partial refunds can sum to the full amount.
  • Refund of fees: when supported by the processor; usually fees are not refunded to the merchant.

Refund flow

  1. Merchant calls POST /charges/{id}/refunds with amount.
  2. Validate: amount + sum(prior refunds) <= original charge amount.
  3. Write a "pending refund" ledger entry: DEBIT merchant_account, CREDIT customer_account.
  4. Call processor refund API with idempotency key.
  5. On success: update ledger entry to "completed". Webhook charge.refunded.
  6. On failure: update to "failed", reverse the ledger entries. Webhook refund.failed.

Settlement: the refund amount is deducted from the merchant's next payout. If the merchant's balance is insufficient, they may owe the platform - this triggers a separate collections flow.

Disputes (chargebacks)
The customer claims a charge to their bank ("I didn't authorize this", "product never arrived"). Bank notifies the network (Visa); network notifies us; we notify the merchant. Merchant has ~7 days to provide evidence (receipts, shipping confirmation, customer communication). Network adjudicates.

Outcomes:

  • Merchant wins → funds returned.
  • Merchant loses → customer keeps refund; merchant also pays a chargeback fee (~$15).

Accounting impact
Disputes create a hold on the merchant's funds for the dispute amount during adjudication. Lost disputes become a permanent debit. High dispute rates trigger network penalties (Visa's VAMP program) which can disable card processing entirely.

Idempotency for refunds
Same as charges - merchant retries must not double-refund. Refund API accepts an idempotency key.

The reconciliation surface
Refunds, disputes, and processor fees all show up in daily settlement reports. Reconciliation must account for all of them - any unreconciled entry is an alert. Stripe's balance_transaction object is the canonical model: every event that moves money is a balance_transaction with a known type (charge, refund, dispute, fee, payout, adjustment).

Trade-offs

Strong consistency vs latency
Money operations require strong consistency on the ledger. The cost is single-leader writes (Postgres primary, Spanner) and ~10-50ms write latency that you cannot trade away. Eventually consistent ledgers are not a real option for the source of truth - they create double-spends. Acknowledge the latency cost; defend the choice.

Build vs buy the processor
If you're a merchant: integrate Stripe / Adyen / Braintree. Building your own card processing requires acquiring bank partnerships, network certifications (Visa, Mastercard), PCI Level 1 compliance, and a team of compliance engineers. Multi-year, multi-million-dollar effort. The interview answer is almost always "buy"; explaining why is the signal.

Sync vs async charge confirmation
Sync (block until processor responds, ~1-2s) is the cleaner UX but couples your latency to the processor. Async (return "pending" immediately, push final state via webhook) decouples but requires merchants to handle async state. Cards are sync by default; high-volume merchants may prefer async for resilience.

Idempotency window length
Longer windows (Stripe: 24h) catch more retries and protect against more failures. Cost: idempotency-store memory and the operational complexity of replaying old responses verbatim. 24 hours is the industry consensus.

Centralized vs sharded ledger
A single global ledger is the simplest model but doesn't scale past ~10K writes/sec. Shard by account_id; cross-account transactions become 2PC or sagas. Most production payment systems shard but try hard to keep transactions intra-shard (e.g., merchant accounts and customer accounts on the same shard for Connect platforms).

PCI scope minimization vs feature surface
Tokenization confines card data but limits what we can do programmatically (we can't show the customer their full card number, can't analyze BIN ranges without a special path). The trade-off is heavily in favor of minimizing scope - PCI audits are expensive and intrusive.

Webhook delivery guarantees
At-least-once is industry standard and the right answer. Exactly-once would require coordination with the merchant (two-phase commit on each delivery), which is impractical. Merchants must build their webhook handlers to be idempotent on event_id - same as we expect callers to do for our charges API.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

  • ?How do you handle a processor outage during peak Black Friday traffic?
  • ?What changes if we need to support 50+ currencies and FX conversion?
  • ?How do you design the ledger for a Connect-style platform (Stripe Connect, Shopify Payments) with destination charges?
  • ?What's your fraud detection architecture and how does it integrate with the charge path latency budget?
  • ?How would you migrate from a single-region ledger to multi-region active-active without losing transactions?
  • ?How do you handle a customer who claims duplicate charges - what's your audit trail?
  • ?What's your strategy for processor failover (primary processor down → route to backup)?
  • ?How would you support subscription billing on top of this charge primitive?

Related system design topics

Companies that test this topic

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →