gitGood.dev
Back to Blog

System Design Interview Cheat Sheet: 15 Patterns You Need to Know

P
Pat
20 min read

There are hundreds of "distributed systems patterns" out there. You don't need to know them all. You need to know the ones that actually come up in interviews - and more importantly, you need to know when and why to reach for each one.

These 15 patterns are the building blocks. They cover maybe 90% of what you'll encounter in a system design round. Some are basic (load balancing), some are more advanced (CQRS, write-ahead logs), but they all share one thing: interviewers expect you to know them and apply them at the right time.

Bookmark this. Come back to it before your interview. Let's go.

1. Load Balancing

A load balancer distributes incoming traffic across multiple servers so no single machine gets overwhelmed. It sits between clients and your application servers, routing requests based on some strategy.

When to use it: Basically always. The moment you have more than one server - which is almost immediately in any system design answer - you need a load balancer.

Key strategies:

  • Round-robin - simple rotation through servers. Works when servers are identical.
  • Least connections - sends traffic to the server handling the fewest requests. Better for uneven workloads.
  • Weighted - assigns more traffic to beefier servers.
  • IP hash - routes the same client IP to the same server. Useful for session affinity.

Tradeoffs: A load balancer is itself a single point of failure. That's why production systems use multiple load balancers (active-passive or active-active). Also, if you need session stickiness (like keeping a user on the same server), round-robin won't work - you'll need IP hashing or external session storage.

Real-world example: AWS uses Elastic Load Balancers in front of basically everything. Netflix distributes billions of requests per day across server fleets using a combination of DNS-based and application-level load balancing.

2. Caching

Caching stores frequently accessed data in a fast layer (usually in-memory) so you don't have to hit the database every time. It's the single biggest performance lever you have.

When to use it: When you have data that's read far more often than it's written, when database queries are expensive, or when you need to reduce latency for hot data.

Cache invalidation strategies - the hard part:

  • Cache-aside (lazy loading) - app checks cache first, falls back to DB, then populates cache. Simple but can serve stale data.
  • Write-through - writes go to cache and DB simultaneously. Always consistent, but slower writes.
  • Write-behind (write-back) - writes go to cache first, DB gets updated asynchronously. Fast writes, but you risk data loss if the cache crashes.
  • TTL (time-to-live) - data expires after a set time. Simple to implement, but you accept some staleness.

Tradeoffs: Cache invalidation is famously one of the two hard problems in computer science. The moment your data changes, every cached copy is potentially stale. You need to think about: what happens on a cache miss? What if the cache goes down? How do you avoid a thundering herd (thousands of requests hitting the DB at once when the cache expires)?

Real-world example: Twitter uses Redis clusters to cache timelines. Instead of assembling a timeline from scratch on every request (which would mean querying hundreds of users' tweets), they precompute and cache it. Facebook's Memcached deployment handles billions of cache requests per second.

3. Database Sharding / Partitioning

Sharding splits your database across multiple machines, with each machine holding a subset of the data. Instead of one massive database, you have many smaller ones.

When to use it: When a single database can't handle your data volume or throughput. If you have billions of rows or thousands of writes per second, vertical scaling (bigger machine) eventually hits a wall.

Common strategies:

  • Range-based - shard by ranges (users A-M on shard 1, N-Z on shard 2). Simple but can create hot spots.
  • Hash-based - hash the shard key and distribute evenly. Better distribution, harder to do range queries.
  • Geographic - shard by region. Great for latency, complex for global queries.

Tradeoffs: Sharding adds serious complexity. Cross-shard queries become expensive or impossible. Joins across shards? Forget about it. Rebalancing shards when you add capacity is painful. And picking the wrong shard key is a decision that's very hard to undo.

Real-world example: Instagram shards its PostgreSQL databases by user ID. Each logical shard maps to a physical database, and they use consistent hashing to distribute users across shards. Discord shards by guild (server) ID since most queries are scoped to a single server anyway.

4. Database Replication (Read Replicas)

Replication copies your database to one or more replica servers. The primary handles writes; replicas handle reads. This multiplies your read capacity without touching the write path.

When to use it: When your workload is read-heavy (most apps are). If you have a 10:1 read-to-write ratio, adding 3 read replicas roughly quadruples your read throughput.

Replication types:

  • Synchronous - write isn't confirmed until all replicas have it. Strong consistency, higher latency.
  • Asynchronous - write is confirmed immediately, replicas catch up later. Lower latency, but replicas can be slightly behind.
  • Semi-synchronous - at least one replica must confirm. A middle ground.

Tradeoffs: Replication lag is the big one. If a user writes something and immediately reads it from a replica that hasn't caught up, they'll see stale data. Common fix: read-after-write consistency - route the user's own reads to the primary for a short window after a write.

Real-world example: GitHub uses MySQL replication extensively. Primary databases handle writes, and a fleet of read replicas serves the vast majority of page loads. Slack uses PostgreSQL replicas for read-heavy queries like message history.

5. Message Queues / Event-Driven Architecture

A message queue decouples producers (things that create work) from consumers (things that process work). Instead of service A calling service B directly, A drops a message on a queue and B picks it up when it's ready.

When to use it: When you need to decouple services, handle traffic spikes without dropping requests, or process work asynchronously. If the work doesn't need to happen in real-time from the user's perspective, it belongs in a queue.

Key concepts:

  • Point-to-point - one producer, one consumer. Think task queues.
  • Pub/sub - one message, many consumers. Think event notifications.
  • Dead letter queue - where failed messages go so they don't block the pipeline.

Tradeoffs: Queues add eventual consistency - the work will get done, but not instantly. You also need to handle idempotency (what if a message gets processed twice?), message ordering (not always guaranteed), and monitoring (a growing queue means consumers can't keep up).

Real-world example: Uber uses Apache Kafka to process billions of events daily - trip updates, pricing changes, driver locations. LinkedIn (which built Kafka) processes over 7 trillion messages per day through their Kafka clusters. Every time you send a Slack message, it flows through a message queue before being persisted and delivered.

6. CDN (Content Delivery Network)

A CDN is a network of servers distributed around the world that cache and serve content close to users. Instead of every request traveling to your origin server in Virginia, a user in Tokyo gets the response from a server in Tokyo.

When to use it: For any static content (images, CSS, JS, videos) and increasingly for dynamic content too. If your users are geographically distributed, a CDN is non-negotiable.

Types:

  • Pull CDN - CDN fetches from your origin on the first request, then caches it. Simple to set up.
  • Push CDN - you proactively upload content to the CDN. More control, more management overhead.

Tradeoffs: Cache invalidation (again). When you deploy a new version of your app, how do you bust the CDN cache? Common approaches: versioned URLs (style.v2.css), cache-control headers, or explicit purge APIs. Also, CDN costs can surprise you at scale - you're paying for bandwidth from every edge location.

Real-world example: Netflix serves over 100 million hours of video daily through its Open Connect CDN. They actually place custom servers inside ISP networks to be as close to viewers as possible. Cloudflare's CDN handles roughly 20% of all web traffic.

7. API Gateway

An API gateway is a single entry point that sits in front of your backend services. It handles cross-cutting concerns like authentication, rate limiting, routing, and request transformation so your services don't have to.

When to use it: When you have multiple backend services and need a unified API for clients. Especially important in microservices architectures where you don't want clients calling 15 different services directly.

Common responsibilities:

  • Request routing (URL path to specific service)
  • Authentication and authorization
  • Rate limiting and throttling
  • Request/response transformation
  • API versioning
  • Logging and monitoring

Tradeoffs: The gateway becomes a critical chokepoint. If it goes down, everything goes down. It can also become a bottleneck if not scaled properly. There's also the organizational question: who owns the gateway? If every team needs gateway changes to deploy, you've created a coordination bottleneck.

Real-world example: Amazon API Gateway handles the API layer for a huge chunk of AWS services. Kong and NGINX are widely used open-source API gateways. Stripe routes all API traffic through a gateway layer that handles authentication, rate limiting, and idempotency key management.

8. Rate Limiting

Rate limiting restricts how many requests a client can make in a given time window. It protects your services from abuse, prevents cascading failures, and ensures fair resource allocation.

When to use it: On any public-facing API. Always. Also useful internally to prevent one service from overwhelming another.

Common algorithms:

  • Token bucket - tokens accumulate at a fixed rate; each request costs a token. Allows bursts.
  • Sliding window - counts requests in a rolling time window. Smoother than fixed windows.
  • Fixed window - counts requests in fixed intervals (e.g., 100 per minute). Simple but has edge cases at window boundaries.
  • Leaky bucket - processes requests at a fixed rate, queuing excess. Smoothest output.

Tradeoffs: Too aggressive and you block legitimate users. Too lenient and it doesn't protect you. In distributed systems, you need centralized rate limiting (usually Redis-backed) or you'll have inconsistencies across servers. Also decide what to return: a 429 status code with a Retry-After header is the standard approach.

Real-world example: GitHub's API allows 5,000 requests per hour for authenticated users. Stripe rate-limits API calls and returns clear headers showing your remaining quota. Twitter's API has notoriously strict rate limits that any developer who's built a Twitter integration has bumped into.

9. Consistent Hashing

Consistent hashing is a technique for distributing data across nodes so that adding or removing a node only requires remapping a small fraction of keys, not everything. Traditional hashing (key % N) breaks horribly when N changes.

When to use it: When you're distributing data across nodes that can change dynamically - cache clusters, database shards, or any distributed storage where nodes join and leave.

How it works: Imagine a ring (0 to 2^32). Each node is placed on the ring at a hashed position. Each key is placed on the ring and assigned to the next node clockwise. When a node is added or removed, only the keys between it and the previous node need to be remapped.

Tradeoffs: Basic consistent hashing can lead to uneven distribution. The fix is virtual nodes - each physical node gets multiple positions on the ring. More virtual nodes means better distribution but more memory for the routing table.

Real-world example: DynamoDB uses consistent hashing to distribute data across its storage nodes. Cassandra uses it for its partitioning strategy. Discord uses consistent hashing to route users to the correct WebSocket server.

10. Microservices vs Monolith

A monolith is a single deployable unit containing all your application logic. Microservices break that into independent, separately deployable services, each owning its own data and business logic.

When to use which:

  • Start monolith when: you're a small team, moving fast, and don't yet know where the service boundaries should be.
  • Move to microservices when: teams are stepping on each other, deployments are risky because one change can break everything, or different components have wildly different scaling needs.

Key principles for microservices:

  • Each service owns its data (no shared databases)
  • Services communicate through well-defined APIs or events
  • Each service can be deployed independently
  • Failure in one service shouldn't cascade

Tradeoffs: Microservices trade code complexity for operational complexity. You now need service discovery, distributed tracing, circuit breakers, and consistent deployment pipelines for dozens of services. Network calls are way slower than function calls. And distributed transactions are genuinely hard.

Real-world example: Amazon famously evolved from a monolith to microservices in the early 2000s - the "two-pizza team" rule came directly from this. Shopify, on the other hand, still runs a large modular monolith and has been very public about why that works for them.

11. CQRS (Command Query Responsibility Segregation)

CQRS separates your read and write models. Instead of one model that handles both reading and writing data, you have two: a write model optimized for updates and a read model optimized for queries.

When to use it: When your read and write patterns are dramatically different. If your writes are complex domain operations but your reads are simple aggregations across many entities, forcing both through the same model creates painful compromises.

How it typically works:

  • Write side: handles commands, validates business rules, persists to a write-optimized store
  • Read side: maintains denormalized views optimized for specific queries
  • An event or sync mechanism keeps the read side updated

Tradeoffs: You now have two models to maintain, and the read side is eventually consistent with the write side. The complexity is significant - don't use CQRS for simple CRUD apps. It shines in event-sourced systems or domains where reads and writes have fundamentally different shapes.

Real-world example: Many financial systems use CQRS - the write side processes transactions with strict validation, while the read side serves account balances and statements from precomputed views. Microsoft Azure's documentation heavily promotes CQRS for complex domain-driven applications.

12. Circuit Breaker

A circuit breaker monitors calls to a downstream service and "trips" when failures exceed a threshold. Once tripped, it immediately returns an error (or a fallback response) instead of continuing to hammer a service that's clearly struggling.

When to use it: Between any two services where the downstream service might fail. Especially critical for services that have lots of dependencies - without circuit breakers, one slow downstream service can bring down your entire system through thread exhaustion.

The three states:

  • Closed - normal operation, requests flow through. Failures are counted.
  • Open - too many failures, all requests are immediately rejected. A timer starts.
  • Half-open - after the timer, a few test requests are allowed through. If they succeed, the circuit closes. If they fail, it opens again.

Tradeoffs: You need to tune the failure threshold and timeout carefully. Too sensitive and the circuit trips on normal fluctuations. Too lenient and it doesn't protect you. You also need to think about what happens when the circuit is open - show cached data? Return a degraded response? Show an error?

Real-world example: Netflix built Hystrix (now in maintenance mode, replaced by Resilience4j) specifically for this pattern. Their system has thousands of services, and circuit breakers prevent cascading failures across the fleet. When the recommendation service is down, you still see Netflix - just without personalized recommendations.

13. Blob/Object Storage

Blob (Binary Large Object) storage is purpose-built for storing unstructured data - images, videos, documents, backups, logs. It's not a database; it's a flat namespace of objects, each identified by a key.

When to use it: Any time you need to store files. Don't put images in your database. Don't store log files on your application servers. Blob storage is cheaper, scales infinitely, and is built for this exact purpose.

Key properties:

  • Virtually unlimited capacity
  • Built-in redundancy (data is replicated across zones/regions)
  • Tiered storage (hot, warm, cold) for cost optimization
  • HTTP-accessible (direct URLs for content serving)

Tradeoffs: Object storage has higher latency than local disk or in-memory storage. It's eventually consistent in some implementations (though S3 is now strongly consistent for reads after writes). You can't modify objects in place - it's always a full overwrite. And costs can add up with high request volumes on small objects.

Real-world example: Dropbox stores over 500 petabytes of user files on their own custom object storage (they migrated off S3 to save costs). Instagram stores every photo and video in S3. Airbnb uses S3 for listing photos and has tiered storage policies to manage costs.

14. Leader Election

Leader election is a mechanism for a group of nodes to agree on a single "leader" that coordinates work. Only one node handles a particular responsibility at a time, and if it fails, the remaining nodes elect a new leader.

When to use it: When exactly one node needs to perform a task at any given time - cron jobs, write coordination, partition assignment, or any singleton responsibility in a distributed system. Without leader election, you'd either have no node doing the work (after a failure) or multiple nodes doing it (causing conflicts).

Common approaches:

  • Consensus algorithms (Raft, Paxos) - nodes vote and agree on a leader. Correct but complex.
  • Lease-based - a node acquires a time-limited lease (e.g., in ZooKeeper or etcd). If it doesn't renew, another node takes over.
  • Database lock - grab a row-level lock. Whoever gets it is the leader. Simple but depends on the DB.

Tradeoffs: Leader election introduces a single point of responsibility (though not a single point of failure if election works correctly). There's always a brief period during election where no leader exists - your system needs to handle that gracefully. Split-brain scenarios (two nodes both think they're the leader) are the nightmare case and need careful prevention.

Real-world example: Kafka uses ZooKeeper (and now its own Raft-based protocol, KRaft) for leader election of partition leaders. Elasticsearch elects a master node that manages cluster state. Google's Chubby lock service, which uses Paxos, is the foundation for leader election across many Google systems.

15. Write-Ahead Log (WAL)

A write-ahead log records every change as an append-only log entry before applying it to the actual data store. If the system crashes mid-operation, the WAL lets you replay or undo incomplete changes during recovery.

When to use it: In any system where durability matters - databases, message queues, distributed storage. It's the foundation of crash recovery. If you're designing a system that stores data and someone asks "what happens if the server crashes mid-write?", WAL is your answer.

How it works:

  1. Write the operation to the append-only log (sequential I/O, very fast)
  2. Acknowledge the write to the client
  3. Apply the change to the actual data structure (which might involve random I/O)
  4. Periodically checkpoint - flush the data structure to disk and truncate old log entries

Tradeoffs: WAL adds a write to every operation (though sequential I/O is fast, so the overhead is small). Logs can grow large if checkpointing is infrequent. Recovery time depends on how much of the log needs to be replayed.

Real-world example: PostgreSQL's WAL is central to its durability and replication story - streaming replication literally ships WAL entries to replicas. Cassandra uses a commit log (same concept). Redis has its AOF (Append Only File) persistence mode, which is essentially a WAL.

Quick Reference Table

PatternWhen to UseWatch Out For
Load BalancingMultiple servers, any scaleSingle point of failure, session stickiness
CachingRead-heavy workloads, expensive queriesCache invalidation, thundering herd
Database ShardingData too large for one DB, high write throughputCross-shard queries, rebalancing, wrong shard key
Database ReplicationRead-heavy workload, high availabilityReplication lag, read-after-write consistency
Message QueuesAsync processing, decoupling servicesEventual consistency, idempotency, monitoring
CDNStatic content, global usersCache invalidation, cost at scale
API GatewayMultiple backend services, cross-cutting concernsSingle point of failure, coordination bottleneck
Rate LimitingPublic APIs, resource protectionToo aggressive limits, distributed consistency
Consistent HashingDynamic node membership, distributed cachesUneven distribution without virtual nodes
MicroservicesLarge teams, different scaling needsOperational complexity, distributed transactions
CQRSVery different read/write patternsEventual consistency, added complexity
Circuit BreakerService-to-service callsTuning thresholds, fallback behavior
Blob/Object StorageFiles, images, videos, backupsLatency, no in-place modification
Leader ElectionSingleton tasks, write coordinationSplit-brain, brief leaderless periods
Write-Ahead LogDurability, crash recoveryLog growth, recovery time

Interview Tips for System Design

Knowing the patterns is only half the battle. Here's how to use them effectively in an actual interview.

Don't just name-drop patterns. Saying "we'll add a cache" isn't impressive. Saying "reads will outnumber writes roughly 100:1 here, so I'd add a Redis cache with a 5-minute TTL in front of the database, using cache-aside so we don't pay the write-through penalty on every update" - that's impressive.

Always explain the tradeoff. Every pattern introduces complexity. Interviewers want to hear you acknowledge that. "Sharding gives us horizontal write scalability, but we lose the ability to do cross-user queries efficiently, so we'd need a separate analytics pipeline for that."

Start simple and evolve. Don't open with "we'll need 47 microservices and a Kafka cluster." Start with the simplest architecture that works, then evolve it as you identify bottlenecks. This mirrors how real systems are built and shows mature engineering judgment.

Know which patterns pair together. Load balancer + read replicas for read scaling. Message queue + circuit breaker for resilient async processing. CDN + blob storage for media serving. CQRS + message queue for event-driven architectures. Showing that you understand how patterns compose is a strong signal.

Quantify when you can. "If we have 10 million daily active users and each user makes 20 requests per day, that's roughly 2,300 requests per second. A single server can handle maybe 1,000 RPS, so we need at least 3 servers behind a load balancer." Back-of-the-envelope math shows you can reason about scale, not just talk about it abstractly.

Have a go-to design for common systems. URL shortener, chat system, news feed, notification service - these come up repeatedly. Practice your designs so the patterns flow naturally instead of feeling forced.

These 15 patterns won't cover every possible interview question. But they'll cover the core of almost every design. Learn them deeply enough that you're not just reciting definitions - you're making real engineering decisions about when to apply each one and what the consequences are.

That's the difference between someone who read a system design blog and someone who can actually design systems.