We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Idempotency keys, double-spend prevention, the ledger model, and why eventual consistency is wrong for balances. The interview where ambiguity costs you money.
Design a payments backend that accepts charge requests, talks to one or more payment processors (card networks, ACH, wallets), records the result in a durable ledger, and exposes status to callers via API responses and webhooks. Support refunds, partial refunds, retries on transient processor errors, and reconciliation against processor settlements.
This problem is graded on correctness obsession, not throughput. Strong candidates open with the two failure modes that ruin a payments system - duplicate charges and lost charges - and design every component to make each impossible. Weak candidates focus on scale and miss the actual interview signal.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Storage
Throughput
Network
Fanout
The system is built around three invariants that override everything else:
The architecture splits cleanly along the sync vs async boundary. Sync path: API gateway → idempotency check → ledger pre-write → processor call → ledger finalize → response. Async path: webhook delivery, settlement reconciliation, fraud scoring, notifications.
Critical observation: the processor call is the only step that touches money. Everything before is preparing to call it; everything after is recording what happened. The hardest design problem is recovering when the processor call succeeds but our process dies before recording the result.
Terminates TLS, enforces auth (API key / OAuth), rate limits per merchant, routes to the charge service. Strips and tokenizes raw card data so it never reaches our application logic (PCI scope reduction).
Lives in a separate, locked-down VPC with PCI compliance scope. Accepts raw card numbers (PAN), returns opaque tokens. Only this service has access to encryption keys for card data. Audited separately, smaller team.
Stores (idempotency_key, request_hash, response) tuples for 24 hours. Enforces 'same key + same request = same response; same key + different request = error'. Backed by a strongly consistent KV store (DynamoDB conditional write, Spanner).
The state machine. Stateless workers; state lives in the ledger and the idempotency store. Drives a charge through pending → processor_called → succeeded/failed. Owns the recovery logic for in-flight failures.
Source of truth for money. Append-only, double-entry bookkeeping. Every charge produces matched debit/credit entries. Strongly consistent (single-leader Postgres or Spanner). All balance queries derive from the ledger - never store balances separately.
Per-processor adapters (Stripe, Adyen, Braintree, ACH, wallets). Normalizes processor APIs into a common interface. Handles processor-specific retry semantics, error code mapping, and webhook signature verification.
Reliably delivers state-change events to merchants. Exponential backoff retries (1m, 5m, 30m, 2h, 8h, 24h, 72h). Signs payloads (HMAC). Tracks delivery state per merchant endpoint. Dead-letter after N attempts.
Async. Daily, ingests processor settlement reports (Stripe balance transactions, Visa BASE II files). Joins against the ledger. Flags discrepancies for ops review. The trust circuit between us and the network.
Synchronous (in the charge path) for high-risk decisions, async for ML-based scoring. Returns approve / review / decline. Decline at this layer never touches the processor (saves processor fees on obvious fraud).
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
Network failures are the default, not the exception. A merchant's server submits a charge, the response times out, the merchant retries. Without idempotency, the customer is charged twice. With idempotency, the second request returns the original response.
Client contract
Every mutating request includes an Idempotency-Key header (UUID). Stripe's contract: 24-hour window, replay returns the original response (including original status code), reusing a key with a different request body returns 409 Conflict.
Server-side flow
The in-progress trap
If our process dies after step 2 but before step 4, the key is stuck "in_progress" forever. Two solutions:
The processor query is the safety net: even if our records are wrong, we can ask Visa "did this auth succeed?" and recover ground truth.
Storage choice
DynamoDB conditional writes are perfect here. Spanner if you need cross-region linearizability. Don't use Redis - it's not durable enough for this; a Redis failure should not cause double-charges.
Key scope
Idempotency keys are scoped to (merchant_id, key) - merchants can use the same UUID without collision. Scope is critical for multi-tenant systems.
The ledger is the source of truth for money. Get this wrong and you cannot recover. The discipline is double-entry bookkeeping borrowed from accounting.
Double-entry model
Every transaction produces matched debits and credits across accounts. A $100 charge:
Sum across all accounts is always zero. Balance of any account = sum of entries.
Why double-entry
Append-only
The ledger is never updated. Refunds, reversals, adjustments are new entries that net against the original. This preserves history for auditors and regulators (mandatory under SOX, PCI, KYC/AML).
Strong consistency required
A user with $100 balance who initiates two simultaneous $80 charges should have one succeed and one fail - not both succeed leaving them at -$60. The naive read-then-write loses to the race; the correct primitive is a conditional update or a transactional ledger write with row-locking.
For Postgres: SELECT FOR UPDATE on the account row, then INSERT into ledger_entries, then UPDATE the cached balance. Wrap in a transaction.
For DynamoDB: optimistic concurrency on the account record (version number), retry on conflict.
For Spanner: just use a transaction; it handles serialization.
Why not eventually consistent?
Eventual consistency for balances means double-spends are possible during the inconsistency window. Even a 100ms window is enough to be exploited by a malicious user. The cost of strong consistency (10-50ms write latency, geographic constraints) is non-negotiable for the money-touching path. Reporting and analytics can be eventually consistent because they're read-only.
Sharding
Ledger sharded by account_id (or merchant_id). All entries for one account live on one shard - intra-account transactions are single-shard, fast. Cross-account transactions (between two merchants on the platform) need a 2-phase commit or a saga pattern - rare, but the design must accommodate.
The hardest failure mode in payments. We sent the charge to Visa. Visa charged the card. Our process crashed before recording the result. Now we have an unrecorded charge - the customer was charged, our system thinks it never happened.
Why this is unavoidable
The processor call and our local commit cannot be one atomic operation. The network sits between them. Any sufficiently long pause between "charge succeeded" and "wrote to our ledger" is a window for failure.
Mitigation 1: pre-write before processor call
Before calling the processor, write a "pending" entry to the ledger with a unique processor_request_id. After the processor responds, update to "succeeded" or "failed". On crash recovery: scan for "pending" entries older than threshold, query the processor with processor_request_id, reconcile.
Mitigation 2: idempotent processor calls
Modern processors support idempotency keys themselves. Stripe accepts an Idempotency-Key on charge requests; replaying with the same key returns the original result without double-charging. Use this. The processor's idempotency window is your safety net for in-flight retries.
Mitigation 3: reconciliation
Daily reconciliation against processor settlement reports catches anything missed by the above. A charge in the processor's report but not in our ledger triggers an ops alert and a backfill.
Mitigation 4: don't acknowledge to the merchant until ledger is durable
The order is: pre-write pending → call processor → write final state → respond to merchant. If we respond before the final write, the merchant might think the charge succeeded while our system has nothing recorded. The merchant should never see "succeeded" without our ledger reflecting it.
The 0.001% case
Even with all of the above, there's a tiny window where the processor charged the card and we never recovered the response (processor outage during reconciliation, etc.). Final fallback: every charge has a customer_visible receipt path. Customers who notice a charge that doesn't appear in the merchant system file disputes; ops investigates. This is the manual failsafe.
The interview signal: candidates who name this failure mode and discuss the layered defenses score significantly higher than those who treat the processor call as a single atomic operation.
Card data (PAN, CVV, expiry) is heavily regulated. Any system that touches it falls under PCI-DSS - quarterly audits, network segmentation, encryption everywhere, access logging, key rotation. The audit cost is enormous.
The architecture answer: minimize what touches card data
Most of the system should never see a PAN. Tokenization confines card data to a small, hardened subsystem.
Tokenization flow
This pushes ~95% of our infrastructure out of PCI scope. Only the processor's frontend SDK and our token-handling code are in scope.
Network-isolated tokenization service (if we are the processor)
If we are Stripe (the processor), we have to handle PANs ourselves. Architecture:
Vault encryption
Data at rest in the vault is encrypted with a per-record DEK (data encryption key) wrapped by a master KMS key. KMS is HSM-backed (AWS KMS, GCP KMS, or HashiCorp Vault with HSM). Rotating master keys requires re-wrapping DEKs; rotating DEKs requires re-encrypting records.
Why this matters at interview time
Interviewers from payments companies probe PCI awareness. Saying "we use Stripe Elements to tokenize on the client" or "we have a network-isolated vault" earns serious credibility. Saying "we encrypt the database column" reveals a candidate who hasn't worked in payments.
Webhooks are how the merchant learns about state changes (charge.succeeded, refund.completed, dispute.opened). They're an async distributed system in their own right, and a frequent source of merchant pain.
Delivery guarantee: at-least-once
We promise to deliver each event at least once. We do not promise exactly-once - that's the merchant's problem to handle via idempotent event_id processing.
Retry schedule
Standard pattern: exponential backoff with jitter. Stripe's schedule is roughly: immediate, 5 min, 30 min, 2 hr, 12 hr, 24 hr, then 24-hour intervals up to 3 days. After ~17 attempts over 3 days, mark as undeliverable; surface in the merchant dashboard.
Signing
Every webhook request includes an HMAC signature (X-Stripe-Signature header) over the payload + timestamp using a per-merchant signing secret. Merchants verify the signature to ensure the event is genuinely from us, not a forged attack on their endpoint.
Per-endpoint queuing
Each merchant endpoint has its own delivery queue. A slow or failing endpoint can't back-pressure other merchants' deliveries. Within an endpoint, events are delivered in order (per object_id ordering matters for state-machine consumers).
Failure modes the merchant will hit
Backfill
Merchants can request a replay of all events for a time window. Critical for new integrations (catch up from when they signed up) and for bug recovery (their system was down; replay to refill state).
Out-of-order delivery
Despite ordered queuing, network reordering means merchants can see refund.created before charge.succeeded if they retry stuck events. Merchants must reconcile against final state, not assume in-order events.
The "charge" abstraction hides wildly different settlement timelines depending on payment method.
Cards (sync)
Authorization happens in real time (~1-2 seconds). Capture happens immediately or up to 7 days later. Settlement (money actually moves) happens 1-3 business days after capture. The merchant sees "succeeded" within seconds; the money lands days later.
ACH (async)
Authorization is essentially trust-based - we submit a debit request to the bank, and 3-5 business days later we learn whether it cleared. The "charge" succeeds initially as "pending" and only confirms days later. Reversals (NSF, returned ACH) can happen weeks later.
BNPL / installments
Customer is approved by the BNPL provider in real time (sync), but the merchant is paid in full upfront and the customer pays the BNPL provider in installments. From our system's view: sync at charge time, but the customer's actual payments happen over months.
Wallets (Apple Pay, Google Pay)
Sync from the user's perspective (face ID / fingerprint). Underlying mechanism is a tokenized card; same settlement timeline as cards.
Crypto (async, irreversible)
Customer initiates a transfer; we wait for N confirmations (10 min - 1 hour for BTC; less for ETH); then mark succeeded. Once confirmed, irreversible - no chargebacks.
Design implication: state machine per payment method
Each payment method has its own state machine. Charge.status can be pending, processing, requires_action, succeeded, failed, refunded. Different methods take different paths. The orchestrator dispatches to method-specific handlers.
Webhook implication
Async methods produce many state changes (created → pending → processing → succeeded → potentially_reversed). Each change is a webhook. Merchants need to handle the full lifecycle, not just "succeeded".
3DS / SCA (regulatory async)
European PSD2 requires Strong Customer Authentication for many card transactions. The flow: charge starts → bank requires 3DS challenge → user redirects to bank's auth page → returns to merchant → charge completes. This injects an async step into a previously-sync flow. State: requires_action → user_completed → succeeded.
Refunds are not just "reverse the charge". They have their own lifecycle, accounting, and edge cases.
Refund types
Refund flow
Settlement: the refund amount is deducted from the merchant's next payout. If the merchant's balance is insufficient, they may owe the platform - this triggers a separate collections flow.
Disputes (chargebacks)
The customer claims a charge to their bank ("I didn't authorize this", "product never arrived"). Bank notifies the network (Visa); network notifies us; we notify the merchant. Merchant has ~7 days to provide evidence (receipts, shipping confirmation, customer communication). Network adjudicates.
Outcomes:
Accounting impact
Disputes create a hold on the merchant's funds for the dispute amount during adjudication. Lost disputes become a permanent debit. High dispute rates trigger network penalties (Visa's VAMP program) which can disable card processing entirely.
Idempotency for refunds
Same as charges - merchant retries must not double-refund. Refund API accepts an idempotency key.
The reconciliation surface
Refunds, disputes, and processor fees all show up in daily settlement reports. Reconciliation must account for all of them - any unreconciled entry is an alert. Stripe's balance_transaction object is the canonical model: every event that moves money is a balance_transaction with a known type (charge, refund, dispute, fee, payout, adjustment).
Strong consistency vs latency
Money operations require strong consistency on the ledger. The cost is single-leader writes (Postgres primary, Spanner) and ~10-50ms write latency that you cannot trade away. Eventually consistent ledgers are not a real option for the source of truth - they create double-spends. Acknowledge the latency cost; defend the choice.
Build vs buy the processor
If you're a merchant: integrate Stripe / Adyen / Braintree. Building your own card processing requires acquiring bank partnerships, network certifications (Visa, Mastercard), PCI Level 1 compliance, and a team of compliance engineers. Multi-year, multi-million-dollar effort. The interview answer is almost always "buy"; explaining why is the signal.
Sync vs async charge confirmation
Sync (block until processor responds, ~1-2s) is the cleaner UX but couples your latency to the processor. Async (return "pending" immediately, push final state via webhook) decouples but requires merchants to handle async state. Cards are sync by default; high-volume merchants may prefer async for resilience.
Idempotency window length
Longer windows (Stripe: 24h) catch more retries and protect against more failures. Cost: idempotency-store memory and the operational complexity of replaying old responses verbatim. 24 hours is the industry consensus.
Centralized vs sharded ledger
A single global ledger is the simplest model but doesn't scale past ~10K writes/sec. Shard by account_id; cross-account transactions become 2PC or sagas. Most production payment systems shard but try hard to keep transactions intra-shard (e.g., merchant accounts and customer accounts on the same shard for Connect platforms).
PCI scope minimization vs feature surface
Tokenization confines card data but limits what we can do programmatically (we can't show the customer their full card number, can't analyze BIN ranges without a special path). The trade-off is heavily in favor of minimizing scope - PCI audits are expensive and intrusive.
Webhook delivery guarantees
At-least-once is industry standard and the right answer. Exactly-once would require coordination with the merchant (two-phase commit on each delivery), which is impractical. Merchants must build their webhook handlers to be idempotent on event_id - same as we expect callers to do for our charges API.
Be ready for at least three of these. The first one is almost always asked.
Five algorithms, three sharding strategies, one fail-open vs fail-closed decision. The bounded design that surfaces in every backend interview loop.
Consistent hashing, eviction, replication, and what really happens when a single hot key takes down the cluster.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →