System design is the round that separates a strong senior candidate from a competent mid-level one. Coding rounds have a right answer; system design does not. It rewards judgment, the ability to reason about trade-offs under ambiguity, and the maturity to say "it depends, and here is what it depends on." That is exactly the skill that scales with seniority, which is why staff and principal loops weight it heavily.
This guide covers the 30 questions that show up most in 2026 loops: a mix of concept questions you should be able to answer in two minutes and classic design prompts that take 35 to 45 minutes to work through. Pair it with our system design walkthroughs for full worked examples, and rehearse out loud against our AI mock interviews, which now include a dedicated System Design mode that drives you through requirements, architecture, scaling, and trade-offs the way a real interviewer does.
The Framework: How to Answer Any System Design Question
Before the questions, internalize the pipeline. Most candidates fail not on knowledge but on structure - they jump straight to drawing boxes. Drive this flow yourself and you control the interview:
- Clarify requirements and scope. Separate functional requirements (what the system does) from non-functional ones (scale, latency, availability, consistency, durability). Pin down read/write ratio, expected DAU, and what is explicitly out of scope. Ten minutes here saves you from designing the wrong system.
- Back-of-envelope capacity estimation. Translate users into numbers: QPS (peak and average), storage per year, bandwidth, memory for a working-set cache. You do not need precision; you need the right order of magnitude so your component choices are justified.
- API design. A handful of endpoints (or a few queue contracts). This forces clarity about what crosses the network and anchors the rest of the conversation.
- High-level architecture. Draw the major components: clients, load balancer, application/service tier, datastores, caches, queues, and any offline/batch path. Keep it coarse first.
- Data model. Pick storage per access pattern, not per fashion. Define the key entities, primary keys, and the access patterns that drive your indexing and partitioning choices.
- Deep dive. The interviewer steers you into one or two components. This is where seniority shows: handle hot keys, consistency, failure modes, and back-pressure.
- Bottlenecks and trade-offs. Name what breaks first at the next order of magnitude and what you would do about it. Always close by stating the trade-offs you made and why.
A good rule: spend the first third on requirements and estimation, the middle third on architecture and data, and the final third on a deep dive plus trade-offs.
Part 1: Core Concept Questions
These are the building blocks. Interviewers use them as warm-ups or as probes during a deep dive. Crisp, trade-off-aware answers here buy you credibility for the design portion.
1. What is load balancing, and what algorithms would you choose?
A load balancer distributes traffic across a pool of servers to maximize throughput, minimize latency, and avoid overloading any one node. Layer 4 (transport) load balancers route on IP and port and are fast and cheap; Layer 7 (application) load balancers inspect HTTP and can route by path, header, or cookie at the cost of more work per request.
Algorithms and their trade-offs:
- Round robin - simple, ignores load; fine when requests are uniform.
- Least connections - better for long-lived or uneven requests.
- Weighted - respects heterogeneous instance sizes.
- Consistent hashing / IP hash - gives session stickiness and is essential for sharded caches.
Mention health checks, connection draining for deploys, and that the LB itself must be redundant (anycast or multiple LBs behind DNS) so it is not a single point of failure.
2. Explain caching strategies and when each applies.
Caching trades freshness for speed. The pattern matters as much as the cache itself:
- Cache-aside (lazy loading) - app reads cache, on miss reads the DB and populates the cache. Most common; resilient because a cold cache still serves from the DB. Risk: stale data and the thundering-herd problem on a popular key expiring.
- Read-through - the cache library handles misses transparently.
- Write-through - write to cache and DB synchronously; strong consistency, higher write latency.
- Write-back (write-behind) - write to cache, flush to DB asynchronously; fast writes but risk of data loss on cache failure.
Always discuss eviction (LRU, LFU, TTL) and cache invalidation, which is genuinely hard. For hot keys, mention request coalescing and short jittered TTLs to avoid synchronized expiry stampedes.
3. What is a CDN and when do you need one?
A Content Delivery Network is a globally distributed set of edge caches that serve static (and increasingly dynamic) content close to users. It cuts latency, offloads origin bandwidth, and absorbs traffic spikes. Use it for images, video, JS/CSS bundles, and cacheable API responses. Key levers: TTL and cache-control headers, cache keys, origin shielding to protect the origin, and cache invalidation/purge on deploy. In 2026, edge compute (running logic at the CDN node) blurs the line between CDN and application tier for personalization and A/B routing.
4. SQL vs NoSQL - how do you choose?
Choose by access pattern and consistency needs, not by hype:
- Relational (SQL) - strong consistency, ACID transactions, flexible ad-hoc queries and joins, mature tooling. Best when relationships matter and you need transactional integrity (payments, inventory). Scales vertically first, then via read replicas and sharding (more operational effort).
- NoSQL - a family, not one thing. Key-value (DynamoDB, Redis) for simple lookups at massive scale; document (MongoDB) for flexible schemas; wide-column (Cassandra) for write-heavy time-series; graph (Neo4j) for relationship traversal.
The honest senior answer: most systems are polyglot. Use a relational store as the source of truth and add specialized stores (search, cache, analytics, vector) for specific access patterns.
5. Explain database replication.
Replication keeps copies of data on multiple nodes for availability and read scaling.
- Leader-follower (primary-replica) - writes go to the leader, reads can fan out to followers. Asynchronous replication is fast but allows replication lag and stale reads; synchronous is consistent but slower and less available.
- Multi-leader - accepts writes in multiple regions; powerful for geo-distribution but introduces write conflicts you must resolve (last-write-wins, CRDTs, app logic).
- Leaderless (quorum) - Dynamo-style; reads and writes hit a quorum (R + W > N) for tunable consistency.
Discuss failover (promoting a follower), the risk of split-brain, and read-your-own-writes consistency for the user who just posted.
6. Explain database sharding (partitioning).
Sharding splits data horizontally across nodes so no single machine holds everything. Strategies:
- Range-based - simple, supports range scans, but prone to hotspots (think sequential IDs or timestamps).
- Hash-based - even distribution, but kills range queries and makes resharding painful.
- Consistent hashing - minimizes data movement when nodes are added or removed.
- Directory/lookup - a routing table maps keys to shards; flexible but adds a lookup dependency.
The hard parts are choosing a shard key with even distribution and low cross-shard query needs, handling celebrity/hot shards, and resharding. Cross-shard joins and transactions are the tax you pay.
7. Explain the CAP theorem and consistency models.
Under a network Partition, you must choose between Consistency (every read sees the latest write) and Availability (every request gets a non-error response). You cannot have both during a partition. In practice partitions are rare, so the real design knob is the latency-vs-consistency trade-off when the system is healthy (this is the PACELC extension).
Consistency models to name, strongest to weakest: strict/linearizable, sequential, causal, read-your-writes, and eventual. Most large-scale systems pick eventual or causal consistency for availability and bolt on stronger guarantees only where they matter (a checkout, not a like count). The skill is knowing which data needs which model.
8. What are message queues and when do you use async processing?
A message queue decouples producers from consumers, smooths traffic spikes, and enables retries and back-pressure. Use it whenever work can be done out of band: sending email, encoding video, fan-out to followers, analytics.
- Queue (SQS, RabbitMQ) - work distribution; a message is consumed once.
- Log/stream (Kafka, Kinesis) - ordered, replayable, multiple independent consumers; the backbone of event-driven systems.
Discuss delivery semantics (at-least-once is the default, so consumers must be idempotent), ordering guarantees, dead-letter queues for poison messages, and consumer lag as a health metric.
9. How do you design a rate limiter (the algorithm)?
The classic algorithms, from coarse to smooth:
- Fixed window - count per time bucket; simple but allows bursts at window edges.
- Sliding window log - exact but memory-heavy.
- Sliding window counter - a good approximation that smooths edges cheaply.
- Token bucket - allows controlled bursts, refills at a steady rate; the most common production choice.
- Leaky bucket - enforces a smooth constant outflow.
For distributed enforcement, keep counters in a shared store (Redis) with atomic operations or Lua scripts, and decide fail-open (availability) vs fail-closed (protection) when the store is down. Mention returning 429 with a Retry-After header.
10. What is idempotency and why does it matter?
An idempotent operation produces the same result whether applied once or many times. It is essential because networks retry: a client that times out will resend, and at-least-once queues will redeliver. Implement it with an idempotency key the client supplies; the server records the key with the result and returns the stored result on a duplicate. This is the standard pattern for payments and any "create" operation. GET/PUT/DELETE are naturally idempotent; POST is not, which is why it needs the key.
11. Explain consistent hashing.
Consistent hashing maps both keys and nodes onto a hash ring; a key belongs to the next node clockwise. Adding or removing a node only relocates keys in one segment instead of remapping everything, which is what plain modulo hashing does. Use virtual nodes (each physical node placed at many ring positions) to even out distribution and smooth the impact of node changes. It is the foundation of distributed caches, Dynamo-style datastores, and any sharded system that needs cheap rebalancing.
12. How does database indexing work, and what are the trade-offs?
An index is a separate data structure (usually a B-tree, or an LSM tree in write-heavy stores) that lets the engine find rows without a full scan. Indexes speed reads but slow writes and consume storage, because every write must update every relevant index. Cover the difference between a clustered index (defines physical row order, one per table) and secondary indexes, composite indexes and column order (leftmost-prefix rule), and when a full scan is actually cheaper. In NoSQL, "indexing" often means modeling a secondary access pattern as another item/partition, since you design tables around queries.
13. How do you do back-of-envelope estimation?
Memorize a few anchors and multiply. Useful constants: roughly 100K seconds in a day; 1 million writes/day is about 12 writes/second average, but assume peak is 2 to 5 times average. Latency ladder: memory access is nanoseconds, SSD reads are tens of microseconds, a cross-country round trip is tens of milliseconds. Walk through: DAU times actions per user gives daily requests; divide by 100K for average QPS; multiply by a peak factor; multiply row size by daily volume by retention for storage. The point is justified component choices, not exact numbers - state your assumptions out loud.
14. How do you keep a service highly available?
Availability is a product of eliminating single points of failure and degrading gracefully. Techniques: redundancy at every tier (multiple AZs, multi-region for the most critical paths), health checks with automated failover, load shedding and circuit breakers to contain cascading failures, retries with exponential backoff and jitter, bulkheads to isolate failure domains, and graceful degradation (serve stale cache or a reduced feature set rather than erroring). Talk about it in nines: 99.9% is about 8.7 hours of downtime per year, 99.99% about 52 minutes. Each nine costs real money, so tie the target to the business.
15. What is the difference between latency and throughput, and how do you measure them?
Latency is how long one request takes; throughput is how many requests per second the system handles. They are related but not the same - batching can raise throughput while hurting latency. Always measure latency in percentiles (p50, p95, p99, p99.9), never averages, because tail latency is what users feel and what fans out in microservice chains. Mention the tail-at-scale problem: a request that hits 100 services each with a 1% slow-tail will be slow most of the time, which motivates hedged requests and timeouts.
Part 2: Classic Design Prompts
These are full 35-to-45-minute prompts. For each, the model answer hits requirements, key components, data model, and the dominant scaling trade-off. Practice driving the framework on each one against our AI mock interviews in System Design mode.
16. Design a URL shortener (TinyURL / bit.ly)
- Requirements: shorten a long URL to a short code, redirect on lookup, optional custom alias and expiry, click analytics. Heavily read-skewed (think 100:1 reads to writes).
- Estimation: hundreds of writes/second, tens of thousands of reads/second; billions of URLs over years drives the key length.
- Key generation: either a base-62 encoding of a globally unique counter/ID (no collisions, but predictable and sequential) or a hash of the URL truncated with collision retry. A pre-generated key pool (a separate "key generation service") removes write-path contention.
- Data model: a key-value store mapping short code to long URL plus metadata. Reads are a single point lookup, perfect for DynamoDB/Redis.
- Scaling: cache hot redirects aggressively (a small fraction of links get most traffic), use a CDN/edge for redirects, and shard by short code. Trade-off: 301 (permanent, cacheable, loses analytics) vs 302 (temporary, every click hits you, full analytics).
17. Design a news feed (Twitter/Instagram timeline)
- Requirements: users follow others; a home timeline shows recent posts from followees, ranked. Read-heavy, latency-sensitive, with extreme fan-out from celebrities.
- Core trade-off - fan-out on write vs read:
- Fan-out on write (push): when you post, write the post ID into every follower's precomputed feed cache. Fast reads, but a celebrity with 50M followers causes a write storm.
- Fan-out on read (pull): build the feed at read time by querying followees. Cheap writes, expensive reads.
- Hybrid (the real answer): push for normal users, pull for celebrities, merge at read time. This is what production systems do.
- Components: post service, fan-out service (async via a queue), feed cache (Redis lists per user), and a ranking service.
- Data model: posts table, follower graph, per-user materialized feed. Mention that ranking now uses ML features, and trim feeds to a bounded length.
18. Design a distributed rate limiter (the service)
Building on question 9: a multi-region service that protects APIs.
- Requirements: limit per user/API key/IP, low added latency, accurate enough, highly available, configurable rules.
- Architecture: a rate-limiting middleware or sidecar at the gateway checks a shared Redis cluster using token bucket via atomic Lua scripts. Counters are sharded by limit key.
- Trade-offs: a centralized store is accurate but adds a network hop and is a dependency; local per-node counters with periodic sync are faster but allow overage. Decide fail-open vs fail-closed. For global limits across regions, accept eventual consistency or pin a key's limit to one region.
19. Design a chat / messaging system (WhatsApp / Slack)
- Requirements: 1:1 and group messaging, online presence, delivery/read receipts, message history, offline delivery, ordering within a conversation.
- Connection layer: persistent WebSocket connections; a connection/session service maps user to the gateway node holding their socket (so you can route an inbound message to the right node). Use a pub/sub bus (Redis, Kafka) between gateway nodes.
- Data model: messages partitioned by conversation ID and ordered by a sequence number or snowflake ID (do not trust client clocks). Store recent messages hot, archive cold history.
- Delivery: persist first, then push; if the recipient is offline, queue and deliver on reconnect, with client acks driving the sent/delivered/read state machine.
- Scaling: millions of concurrent sockets means many stateless gateway nodes; group fan-out is the hot spot, so cap group size or fan out asynchronously.
20. Design a notification system
- Requirements: send push, SMS, and email across channels; support transactional and bulk; respect user preferences and quiet hours; deduplicate; handle provider failures.
- Architecture: a notification API writes to a queue; per-channel workers (APNs/FCM, an SMS provider, SES) consume and call the third party. A template service renders content; a preferences service gates each send.
- Reliability: idempotency keys to avoid duplicate sends on retry, dead-letter queues, rate limiting per provider, and provider failover. Mention digesting/batching to avoid notification fatigue, and an audit log for "did we send it".
- Scaling trade-off: at-least-once delivery plus dedupe is simpler and safer than chasing exactly-once.
21. Design a ride-sharing service (Uber / Lyft)
- Requirements: riders request rides, match to nearby drivers, live location tracking, pricing, trip lifecycle.
- Geospatial core: drivers stream location updates frequently; index them with a geohash or QuadTree/H3 grid so "find drivers near a point" is a cheap cell lookup rather than a scan. Keep the live location index in memory.
- Matching: a dispatch service queries the spatial index for nearby available drivers and runs a matching/assignment algorithm; handle the race where one driver gets two offers with a lock/claim.
- Data model: ephemeral high-write location data (in-memory/Redis) separated from durable trip records (relational source of truth).
- Scaling: partition by geography (a city is a natural shard), and note surge pricing is a demand/supply signal computed per region. The dominant challenge is the firehose of location writes.
22. Design a web crawler
- Requirements: crawl a large set of pages, extract links, store content, be polite (respect robots.txt and rate limits), avoid traps and duplicates, stay fresh.
- Components: a URL frontier (priority + politeness queues), fetchers, a parser/extractor, a deduplication layer, and content storage.
- Key problems: deduplicate URLs (a Bloom filter for "seen" at scale) and near-duplicate content (SimHash/MinHash); enforce per-domain politeness so you do not hammer one host; detect crawler traps and infinite spaces.
- Scaling: distributed fetchers partitioned by domain, a durable frontier (often Kafka-backed), and recrawl scheduling weighted by how often a page changes. In 2026 this also feeds search indexes and RAG/LLM training corpora, which raises the bar on freshness and content cleaning.
23. Design a distributed cache (like Redis/Memcached at scale)
- Requirements: low-latency key-value reads/writes, horizontal scale, high hit rate, fault tolerance.
- Sharding: consistent hashing with virtual nodes spreads keys and minimizes movement on membership changes.
- Eviction and memory: LRU/LFU within a node; size the cache to the working set, not the whole dataset.
- Availability: replicas per shard with leader-follower; decide whether a node failure means a cache miss (fall through to DB) or a failover. Beware the thundering herd when a hot key expires - use request coalescing and jittered TTLs.
- Trade-offs: client-side hashing (no proxy hop, but clients must know the topology) vs a proxy/coordinator layer (simpler clients, an extra hop). Write strategy ties back to question 2.
24. Design a typeahead / autocomplete system
- Requirements: suggest top completions for a prefix within tens of milliseconds, ranked by popularity, updated as trends shift.
- Core structure: a trie where each node caches its top-K completions so a lookup is a single traversal, not an aggregation. Shard the trie by prefix.
- Freshness: rankings come from query logs aggregated offline (a streaming/batch pipeline) and pushed to the serving tier periodically; you trade real-time accuracy for serving speed.
- Scaling: cache hot prefixes at the edge, debounce client requests, and cap suggestion length. The trade-off is update latency vs serving simplicity.
25. Design a key-value store (Dynamo-style)
- Requirements: get/put by key, always-writable, horizontally scalable, tunable consistency, fault tolerant across nodes.
- Partitioning: consistent hashing distributes keys; each key is replicated to N nodes.
- Consistency: quorum reads/writes (R + W > N) give tunable consistency; on conflicts use vector clocks or last-write-wins, and read repair plus anti-entropy (Merkle trees) to converge replicas.
- Availability: hinted handoff lets a healthy node temporarily accept writes for a down node. This is a leaderless design, so it favors availability and partition tolerance over strict consistency.
- Trade-off: the application must tolerate occasional conflicting versions; you pushed consistency resolution up to the read path.
26. Design a video streaming platform (YouTube / Netflix)
- Requirements: upload, transcode, store, and stream video at scale; adaptive quality; global low-latency playback; recommendations.
- Ingest pipeline: upload to object storage, then an async transcoding pipeline (fan-out via queue) produces multiple bitrates and segments for adaptive streaming (HLS/DASH).
- Delivery: a CDN is the heart of the system - segments are cached at the edge close to viewers, which is what makes global streaming affordable. Origin storage (S3-class) holds the masters.
- Metadata: relational/NoSQL for video metadata, view counts (approximate, via stream aggregation), and the recommendation feature store.
- Trade-off: precompute and store every bitrate (more storage, instant playback) vs transcode on demand (less storage, higher first-play latency). At scale, precompute wins for popular content.
27. Design a payment / wallet system
- Requirements: move money correctly, exactly once, with a complete audit trail; no double charges; reconciliation.
- Correctness first: use a relational store with ACID transactions as the source of truth; model balances with double-entry ledger entries (every transaction is a balanced pair of debits and credits) rather than mutating a single balance field.
- Idempotency: every charge carries an idempotency key so retries do not double-charge (ties to question 10).
- Async + consistency: integrate external processors (Stripe) via webhooks; use the outbox pattern or a saga to keep your DB and the processor in sync without distributed transactions.
- Trade-off: strong consistency and durability dominate here - this is the one place you do not reach for eventual consistency. Reconciliation jobs catch any drift between your ledger and the processor.
28. Design a unique ID generator at scale
- Requirements: generate unique, roughly time-sortable, 64-bit IDs across many nodes without coordination on the hot path.
- Approaches:
- Snowflake - pack a timestamp, a machine/worker ID, and a per-millisecond sequence number into 64 bits. Sortable by time, no central coordinator, the de facto standard.
- UUID v4 - trivially distributed and collision-safe, but 128 bits and not sortable (bad for index locality); UUID v7 fixes sortability.
- DB ticket server / segment allocation - hand out ranges of IDs in batches to amortize coordination.
- Trade-off: sortability and compactness (Snowflake) vs zero coordination and simplicity (UUID). Watch for clock skew and the worker-ID assignment problem in Snowflake.
29. Design a search / indexing system
- Requirements: full-text search over a large corpus with low latency, ranking by relevance, and near-real-time updates.
- Core: an inverted index (term to posting list of documents) is the heart; queries intersect posting lists and score with TF-IDF or BM25. Shard the index by document and replicate each shard for read scale and availability.
- Pipeline: an ingestion path tokenizes, normalizes, and writes to the index; a query path fans out to all shards, gathers top-K from each, and merges (scatter-gather).
- 2026 reality: pair the lexical inverted index with a vector index (HNSW/IVF over embeddings) for semantic search, and combine the two with hybrid ranking. This is now the backbone of RAG systems feeding LLMs.
- Trade-off: index freshness vs query performance, and recall vs latency in the vector tier.
30. Design an LLM inference / serving gateway
A modern prompt that reflects where infrastructure is going.
- Requirements: route prompts to model backends, stream tokens back, enforce per-user rate and cost limits, cache, and stay available under bursty load.
- Components: an API gateway with auth and rate limiting (questions 9 and 18), a router that picks a model/region, a request queue with back-pressure to protect scarce GPU capacity, and streaming responses over SSE/WebSocket.
- Cost and latency levers: a semantic/prompt cache for repeated queries, batching requests on the GPU to raise throughput (at some latency cost), token-level budgets per user, and graceful fallback to a cheaper/smaller model under load.
- Trade-offs: GPUs are the scarce, expensive resource, so the whole design is about maximizing utilization (batching, queuing) without blowing the latency budget. Observability on tokens, cost, and tail latency is non-negotiable.
How to Practice
Reading these is necessary but not sufficient. System design is a performance skill - it lives in your ability to drive the framework out loud, under questions, while drawing. To prepare effectively:
- Work full examples end to end in our system design walkthroughs, which carry capacity estimation, architecture, deep dives, and trade-offs for each topic.
- Rehearse against our AI mock interviews in System Design mode, which pushes you from requirements through scaling and trade-offs the way a real interviewer does, then gives feedback on what you skipped.
- Shore up the fundamentals these designs lean on with targeted practice questions across distributed systems, databases, and networking.
- If you are aiming at infrastructure or cloud roles specifically, follow the Cloud/Solutions Architect learning path to sequence the underlying concepts before the mock loops.
The candidates who do best are not the ones who memorize the most designs. They are the ones who can take a prompt they have never seen, clarify it, estimate it, and reason about trade-offs calmly. Build that muscle and the specific question stops mattering. Now go run one out loud.