gitGood.dev
Systems

Distributed Systems Patterns Cheat Sheet

SystemsFREELast updated: June 2026 · By gitGood Editorial

A reference for the theorems, consistency models, replication and partitioning strategies, delivery guarantees, and resilience patterns that come up in system design interviews.

CAP, PACELC, and consistency

The theorems that frame every distributed design decision.

CAP theorem (and PACELC)

CP (consistency + partition tolerance)

During a network partition the system rejects or blocks requests rather than serve stale or divergent data. You give up availability to stay correct. Examples: a strongly-consistent store, a system requiring quorum.

AP (availability + partition tolerance)

During a partition every node keeps serving, accepting reads/writes that may conflict or be stale, and reconciles later. You give up linearizable consistency to stay up. Examples: Dynamo-style stores, DNS.

PACELC extension

CAP only describes behavior during a Partition. PACELC adds: Else (no partition), you still trade Latency against Consistency. Even a healthy system pays coordination latency to be consistent.

When to choose each

Choose CP when correctness is non-negotiable (payments, inventory, locks). Choose AP when uptime and low latency matter more than freshness (feeds, carts, telemetry). Remember partitions are rare but inevitable, so the everyday choice is usually the PACELC latency-vs-consistency one.

Consistency models (strong to weak)

Linearizable (strong)

Every operation appears to take effect instantaneously at some point between its call and return, in real-time order. Reads always see the latest committed write. Most intuitive, most expensive - needs consensus/coordination.

Sequential

All clients see operations in the same single global order, and each client's own operations keep their program order - but that global order need not match real-time. Slightly weaker than linearizable.

Causal

Operations that are causally related (one could have influenced another) are seen in the same order by everyone; concurrent, unrelated operations may be seen in different orders. A practical sweet spot - preserves 'reply after question' without full coordination.

Eventual

If writes stop, all replicas eventually converge to the same value, but at any moment reads may be stale or out of order. Cheapest and most available. Often paired with conflict resolution (last-write-wins, CRDTs).

When to choose each

Linearizable/sequential for locks, leader election, and money. Causal for collaborative apps, comments, and messaging where order-within-a-thread matters. Eventual for caches, counts, presence, and large-scale feeds.

Quorum math

In a leaderless or quorum-replicated system with N replicas, a write must be acknowledged by W of them and a read must contact R of them. The key inequality is R + W > N: it forces the read set and the write set to overlap by at least one replica, so any read is guaranteed to see at least one copy of the most recent successful write. Common configurations: W = N and R = 1 makes reads fast but writes slow and fragile (any replica down blocks writes); R = N and W = 1 is the reverse; W = R = (N/2) + 1 (a majority quorum) balances both and is what consensus systems use. A second rule, W > N/2, prevents two conflicting writes from both succeeding by ensuring write sets overlap. Quorums tolerate failures: with N = 3 and W = R = 2 you survive one node down. Note that overlap guarantees you read a recent value but not necessarily the absolute newest if writes are concurrent - you may need version vectors or read-repair to reconcile.

Replication strategies

How writes propagate to replicas

Single-leader (leader-follower)

All writes go to one leader, which streams a replication log to read-only followers. Simple, no write conflicts, easy strong reads from the leader. But the leader is a write bottleneck and a failover point; followers can lag (stale reads).

Multi-leader

Several nodes accept writes and replicate to each other (e.g. one leader per region or for offline clients). Better write availability and locality, but concurrent writes to the same key create conflicts you must resolve (last-write-wins, merge functions, CRDTs).

Leaderless (Dynamo-style)

Clients write to and read from multiple replicas directly using quorums (R + W > N), with read-repair and anti-entropy to converge. Highly available and partition-tolerant, but the application must handle stale reads and concurrent-version reconciliation.

When to choose each

Single-leader is the default for most relational/OLTP systems. Multi-leader for multi-region write locality or offline-first clients. Leaderless when availability under partition trumps simplicity.

Delivery semantics

At-most-once

Each message is delivered zero or one time - never duplicated, but may be lost. Fire-and-forget. Acceptable for non-critical metrics where loss is cheaper than duplication.

At-least-once

Each message is delivered one or more times - never lost, but may be duplicated (retries after an un-acked delivery). The common default for queues. Requires idempotent consumers to be safe.

Exactly-once (effectively-once)

Each message takes effect exactly one time. True exactly-once network delivery is impossible; it is achieved as at-least-once delivery plus idempotency/deduplication on the consumer (dedup keys, transactional offsets).

When to choose each

Use at-least-once + idempotency for almost everything that matters (payments, orders). At-most-once only for lossy-tolerant telemetry. Treat 'exactly-once' as a consumer-side property, not a delivery guarantee.

Partitioning (sharding) strategies

How to split data across nodes so it scales horizontally.

Range partitioning
Assign contiguous key ranges to shards (A-F here, G-M there). Great for range scans and ordered queries, but prone to hotspots if keys are skewed (e.g. everyone writing the latest timestamp hits one shard).
Hash partitioning
Shard by hash(key) mod N. Spreads load evenly and avoids hotspots, but destroys key locality (range scans must hit every shard) and resharding when N changes moves almost all keys.
Consistent hashing
Map both keys and nodes onto a hash ring; a key belongs to the next node clockwise. Adding or removing a node only remaps the keys in its arc, not the whole dataset - the standard fix for hash partitioning's reshard problem.
Virtual nodes
Give each physical node many points on the ring so load is smoother and rebalancing on node changes is finer-grained. Used by Dynamo/Cassandra-style systems.
Directory / lookup-based
Keep an explicit map from key (or key-range) to shard in a lookup service. Most flexible (arbitrary placement, easy rebalancing) but the directory is an extra hop and a potential single point of failure.

Consensus and leader election

Consensus is getting a set of nodes to agree on a single value (or an ordered log of values) despite crashes and message delays. It underpins leader election, distributed locks, configuration, and replicated state machines. Paxos is the classic algorithm - provably correct but famously hard to understand and implement. Raft was designed to be understandable: it elects a single leader for a term, the leader appends entries to a replicated log and commits an entry once a majority of followers have acknowledged it, and followers redirect writes to the leader. If the leader fails, followers notice a missing heartbeat, increment the term, and hold an election; a candidate wins by collecting votes from a majority. Both rely on majority quorums, so a cluster of 2F + 1 nodes tolerates F failures (5 nodes survive 2 down) - which is why consensus clusters are sized at odd numbers like 3 or 5. The practical takeaway for interviews: do not roll your own consensus; lean on Raft-based systems (etcd, Consul, ZooKeeper's ZAB variant) to elect leaders and store critical metadata.

Distributed transactions

Saga vs two-phase commit (2PC)

Two-phase commit (2PC)

A coordinator asks every participant to PREPARE (vote and lock resources), then if all vote yes sends COMMIT, else ABORT. Gives atomic all-or-nothing across services with strong consistency, but holds locks across the round trip and blocks if the coordinator dies mid-protocol (a blocking protocol). Poor availability and scalability.

Saga

Model a long transaction as a sequence of local transactions, each with a compensating action that semantically undoes it. If step 4 fails, run the compensations for steps 3, 2, 1 in reverse. No distributed locks, high availability, but only eventual/atomicity-by-compensation - intermediate states are visible and you must design idempotent steps and compensations. Orchestrated (central coordinator) or choreographed (events).

When to choose each

2PC only within a tight trust/latency boundary where strong atomicity is required and the blocking risk is acceptable. Sagas for cross-service business workflows (order -> payment -> shipping) where availability matters and compensations make sense.

Resilience and integration patterns glossary

Outbox pattern
Write the business row and an 'event to publish' row in the same local DB transaction; a separate relay reads the outbox and publishes to the broker. Solves the dual-write problem (DB committed but message lost) and gives at-least-once publishing.
Change data capture (CDC)
Stream a database's commit/replication log (e.g. via Debezium) as a feed of change events. Lets downstream systems, caches, and search indexes react to data changes without the source service publishing them explicitly.
Circuit breaker
Wrap calls to a dependency; after a failure threshold the breaker OPENS and fails fast (no call) for a cooldown, then HALF-OPENS to test recovery. Prevents hammering a sick dependency and cascading failures.
Bulkhead
Isolate resources (thread pools, connection pools) per dependency so a failure or slowdown in one cannot exhaust capacity shared by the others - like watertight compartments in a ship.
Backpressure
Signal upstream producers to slow down when a consumer is saturated (bounded queues, flow-control, rejecting/shedding load) instead of buffering unbounded and crashing.
Idempotency key
A client-supplied unique key per logical operation; the server stores the result keyed by it so retries return the same outcome without re-applying side effects. The practical backbone of at-least-once + exactly-once-effect.
Retry with backoff + jitter
Retry transient failures with exponentially increasing delays plus randomization, so retries do not synchronize into a thundering herd. Pair with a retry budget and idempotency.

Failure-mode gotchas

The assumptions that bite distributed systems.

  • ·The network is not reliable - messages are lost, delayed, duplicated, and reordered; a missing response means 'unknown', not 'failed'. Design for ambiguity.
  • ·There is no global clock - wall-clock skew makes timestamp ordering unsafe; use logical clocks (Lamport), version vectors, or tightly-bounded clocks (TrueTime) for ordering.
  • ·The Fallacies of Distributed Computing - assuming zero latency, infinite bandwidth, a secure/homogeneous network, or zero transport cost will eventually break your design.
  • ·Split-brain - a partition can leave two sides each believing they are the leader and accepting writes; prevent it with quorum/fencing tokens, not just heartbeats.
  • ·Cascading failure - one slow dependency backs up callers' threads/queues until the whole system falls over; contain it with timeouts, circuit breakers, and bulkheads.
  • ·Thundering herd / retry storms - synchronized retries or cache expiries stampede a recovering service; use jittered backoff, request coalescing, and staggered TTLs.
  • ·Stale reads after failover - a promoted follower may be behind the old leader; reads can go backwards in time unless you read from a quorum or pin to the leader.
  • ·Duplicate delivery is normal - at-least-once means handlers WILL see repeats; non-idempotent side effects (charging a card twice) are a design bug, not bad luck.

Other cheat sheets

Big-O Reference

Algorithms

Time and space complexity for the data structures, sorting algorithms, and search routines that show up in coding interviews. Skim the row, remember the row, defend the row in an interview.

Interview Patterns

Patterns

The recurring shapes - sliding window, two pointers, fast/slow, BFS/DFS, backtracking, DP, divide & conquer, binary search variants, union-find, topological sort. Each entry: when to reach for it, the template, complexity, and which classic problems use it.

Design Tradeoffs

Systems

The recurring forks in system design interviews. CAP, PACELC, sync vs async, push vs pull, SQL vs NoSQL, sharding shapes, consistency models, cache strategies, idempotency, and rate limiting. For each, the options and when to choose each.

Unix Essentials

Tools

Filesystem layout, the commands you actually use (find / grep / awk / sed / xargs), processes and signals, networking, permissions, basic shell scripting, and a vi survival kit.

SQL Essentials

Tools

Query clause order, every JOIN type and when to use it, aggregates vs window functions, what indexes actually buy you, transaction isolation levels, and the NULL / WHERE-vs-HAVING / EXISTS-vs-IN gotchas interviewers fish for.

Git Essentials

Tools

The everyday commands, every undo scenario mapped to its fix, rebase vs merge with a side to pick, interactive rebase, bisect, the reflog safety net, stash, and the flags worth aliasing.

Docker & K8s

Tools

The docker and kubectl commands you reach for daily, Dockerfile best practices, how layer caching actually works, the core k8s objects in one screen, requests vs limits, liveness vs readiness, and a step-by-step CrashLoopBackOff debug flow.

REST API Design

Systems

Method semantics and idempotency, the ~15 status codes that matter, resource naming rules, offset vs cursor pagination, versioning and auth tradeoffs, error body conventions, rate-limit headers, and the smells reviewers flag.

STAR Method

Patterns

The STAR structure with timing, what interviewers actually grade, eight question archetypes and how to frame each, the anti-patterns that sink answers (rambling, "we" instead of "I", no metrics), and a 30-second answer skeleton.

Networking

Systems

TCP vs UDP, the TLS and TCP handshakes, HTTP versions, status codes, DNS resolution, the OSI and TCP/IP layer models, and the ports you are expected to know in an interview.

Regex

Tools

Anchors, character classes, quantifiers, groups, alternation, lookarounds, backreferences, and flags - plus practical patterns and the gotchas that trip people up in interviews.

Linux Perf

Tools

The USE method, a first-five-minutes triage runbook, and the CPU, memory, disk, network, and tracing commands you reach for when a Linux box is misbehaving.

Concurrency

Patterns

A fast reference for concurrency primitives, synchronization tradeoffs, the memory model, and the classic bugs that show up in systems interviews and real code.

Practice the patterns

Reading is the floor. The signal in interviews comes from working problems out loud and defending your tradeoffs. Spin up an AI mock interview or run a coding challenge to put these to work.