Inside the Google System Design Interview Loop

The Google system design round is the single most important factor in whether you come in at L4 or L5 - and at L5 vs L6, it is usually decisive. Coding can get you to a pass; system design decides your level.

Most candidates over-prepare on coding and under-prepare on system design. They show up having watched ten YouTube mocks, drawn the same Twitter / Uber / WhatsApp diagrams, and assume they will be fine. They will not be fine. Google evaluators are pattern-matching against thousands of past loops and they can tell when someone is reciting versus thinking.

Here is what is actually happening in the room.

How the Loop Is Structured

Google's standard onsite is five rounds: two coding, one system design, one behavioral / Googleyness, and a fifth that varies (sometimes a second SD, sometimes coding, sometimes a domain-specific round).

The number of system design rounds depends on the level you are interviewing for:

L3 (new grad / early career): zero SD rounds. Pure coding + behavioral. If you got asked an SD question at L3, it was a fit thing, not a graded round.
L4 (mid-level, "SWE II"): one SD round, usually 45 minutes. Lower expectations - a simple distributed system (chat app, photo sharing, URL shortener) with reasonable depth in one or two areas.
L5 (Senior): one SD round, sometimes two. 45-60 minutes each. Significantly higher expectations - capacity estimation, multiple deep dives, real tradeoff articulation.
L6 (Staff): two SD rounds, often with a third "advanced" round added. One round usually focuses on a specific subsystem deep dive (e.g., "design the consistency layer for a multi-region database"). Architecture maturity is the bar.
L7+ (Senior Staff and above): three SD rounds, often domain-specific. Coding load drops to one round.

The system design interviewer will be a Senior+ engineer (L5 minimum, often L6+ for Staff loops). They have done dozens of these. They are not impressed by buzzwords.

The Format Inside the Room

Every Google SD round looks roughly the same:

Minutes 0-5: Prompt + clarifying questions.
The interviewer gives you a one-line prompt: "Design YouTube." "Design a distributed cache." "Design the system that serves Google Search autocomplete." Then they shut up. You drive.

Minutes 5-10: Requirements + capacity estimation.
You write functional requirements (what the system does), non-functional requirements (latency, availability, consistency), and rough numbers (DAU, QPS, data size, growth).

Minutes 10-25: High-level architecture.
You draw the boxes. Client, load balancer, service tier, storage, cache, queue. Show data flow for the main use cases.

Minutes 25-40: Deep dives.
The interviewer steers you into 1-3 specific areas. "Tell me more about how the write path works." "What happens when this region goes down?" "Walk me through the schema for X." This is where most of the signal is.

Minutes 40-45: Tradeoffs + wrap.
What did you optimize for? What did you give up? If you had another hour, what would you go deeper on?

If you are blowing past these stages, you are going too fast. If you have not gotten to deep dives by minute 30, you are going too slow.

What Evaluators Actually Score

Google's system design rubric is well-defined internally and reasonably consistent across the company. The categories scored, in roughly the order interviewers care about them:

1. Clarifying questions (15% of signal).
The single biggest separator between L4 and L5+. Strong candidates spend 3-5 minutes pinning down scope. Weak candidates start drawing boxes in minute one.

Good clarifying questions:

"Are we designing the read path, the write path, or both?"
"What is the read:write ratio? Is this a 100:1 read-heavy system or 1:1?"
"What is the latency target? P50 / P99?"
"Single region or multi-region? Active-active or failover?"
"What is the consistency requirement? Linearizable? Eventually consistent? Read-your-writes?"

These are not generic checkbox questions. They actually change your design. If the read:write is 1000:1, you cache aggressively; if it is 1:1, caching is mostly noise. The clarifying questions tell the interviewer that you understand which numbers drive which decisions.

2. Capacity estimation (15%).
Numbers, fast. Daily active users → QPS. QPS × payload size → bandwidth. Storage growth per year. Write amplification.

The trick: round to one significant figure, do the math out loud, and never get stuck. "1B users, 10% are active, 1% post a tweet a day - that is 1M writes a day, ~12 writes a second average, peak maybe 100." If your estimation is off by 2x, no one cares. If you cannot do the estimation at all, you are downleveled.

3. API design (10%).
What endpoints does this system expose? What are the request / response payloads? Authentication? Rate limiting?

Most candidates skip this. Google interviewers care - especially for L5+ - because API design is a maturity signal. A clean API tells you the candidate has actually built systems other teams consume; a messy one tells you they have only built backend services that they call from their own frontend.

4. High-level architecture (15%).
The boxes. But the score is not "did you draw the right boxes" - it is "did you justify each box and show the data flow."

Drawing a load balancer in front of your service tier is table stakes; explaining why you put a load balancer there ("we need to terminate TLS, do round-robin distribution, and provide a single DNS entry for clients") is the actual signal.

5. Deep dives (30%).
The biggest single category. The interviewer picks 1-3 areas and you go deep.

Strong deep dives:

Specific data structures with their tradeoffs ("we use a B-tree index here for range queries; we accept the write amplification because reads dominate")
Concrete schemas with column types and partition keys
Specific failure modes and how the system handles them ("if the cache evicts the entry, we fall back to the DB and re-warm; if the DB primary fails, we promote a replica via the consensus layer with ~10s of unavailability for that shard")
Real numbers ("at 1M QPS with 1KB payloads, we need ~1GB/s of bandwidth per region, which is fine on a single 25Gbps NIC")

Weak deep dives:

Hand-waving ("we use Redis here for caching")
Naming systems instead of describing behavior ("we use Kafka for the queue")
Recursing into "and then we use [other system]" without ever describing the actual mechanics

6. Tradeoff articulation (15%).
"You picked X over Y - why?" The interviewer is checking that you understand the design space, not just the design.

Good tradeoff answers always name the alternative explicitly. "I picked eventual consistency over strong consistency because the read latency target is 50ms and a quorum write across three regions is 80-100ms. The tradeoff is that two writes in the same second can be reconciled out of order - I am OK with this for likes / comments, but for ledger entries I would use the strong path."

Google-Specific Quirks

The internal infrastructure does not change the rubric, but it changes the vocabulary you should know.

Spanner. Google's globally-distributed, externally-consistent database. If you propose a "globally consistent SQL store with bounded staleness" you have just described Spanner. Naming it is not required and does not score points, but understanding why it exists - the TrueTime API, the cost of strong consistency at global scale - is L5+ material.

Bigtable / Colossus. Wide-column NoSQL on top of Google's distributed file system. Most read-heavy storage at Google sits on this. Knowing the consistency model (single-row atomic, no cross-row transactions) helps when you are drawing storage tiers.

Borg / Kubernetes. Borg is the cluster manager Kubernetes was modeled on. You will not be asked about Borg specifically, but talking about your service tier in terms of "task counts," "cells," and "auto-scaling on QPS or CPU" lands well.

Stubby / gRPC. Google's internal RPC system; gRPC is the open-source descendant. You should reflexively use RPC (not REST) for service-to-service calls and be able to articulate why - smaller payloads, streaming, better backwards compatibility, codegen across languages.

Pub/Sub vs Spanner change streams. Two ways to do async eventing inside Google. If you are designing something that needs reliable async fanout, knowing the tradeoff (Pub/Sub is cheap and at-least-once; Spanner change streams give you exactly-once tied to the write path) is a strong signal.

Pivoting to Google scale. When you are drawing a system, the implicit scale is Google's. A "small" service at Google is 10K QPS. A "large" service is 10M+. If you draw a single Postgres instance for the canonical write store, you have just told the interviewer you do not understand the scale they live at.

You do not need to use any of these names. The interviewer will not penalize you for saying "Cassandra" instead of "Bigtable." But the underlying patterns - global consistency, wide-column storage, RPC-based service mesh, fleet management - are how Google engineers actually think, and matching that mental model is most of why "former Googlers do well at Google interviews."

A Worked Example: "Design Google Photos"

Let me walk through one round end to end at the L5 level. The prompt: "Design a system like Google Photos that lets users upload photos, view them, and search them by content."

Clarifying questions (5 minutes)

Scope: upload, viewing, search. Anything else? (Sharing? Editing? Albums?) → Interviewer says: assume single-user library, no sharing, no editing. Search by content (faces, objects, text) is in scope.
Scale: how many users? → Interviewer says: assume 1B users, average 10K photos per user, 100M new uploads per day.
Latency: viewing? → P99 < 500ms for thumbnail, < 2s for full-res.
Consistency: when I upload, do I need to see it in my library immediately? → Yes, read-your-writes for the uploader. Other users do not apply (no sharing).

Capacity estimation (5 minutes)

1B users × 10K photos = 10^13 photos total. Average 3MB per photo (compressed JPEG) = 30 PB of original storage. With thumbnails (10% of original) and a high-res derivative (50% of original) = ~50 PB total.
Writes: 100M/day = ~1200/sec average, peak maybe 5000/sec.
Reads: assume each user opens the app twice a week and views 50 thumbnails per session = 1B × 2 × 50 / 7 = ~14B thumbnail views/week = ~25K thumbnail views/sec average, peak maybe 100K.
ML / search index: every upload triggers ~3 ML pipelines (face, object, OCR). At 1200 writes/sec × 3 = ~4000 ML inferences/sec.

High-level architecture (15 minutes)

Draw out:

Mobile + web client
API gateway (TLS termination, auth, rate limit)
Upload service → object storage (originals) + thumbnail-generation queue
Thumbnail worker → object storage (thumbnails)
Metadata service → wide-column DB (photo metadata: user_id, photo_id, timestamp, location, file refs)
ML pipeline: queue → face / object / OCR workers → search index (per-user inverted index)
Search service: query → search index → metadata service → CDN URLs
CDN in front of object storage for serving

Deep dives (15 minutes)

Interviewer steers to: "Tell me about the search index. How is it sharded?"

You go deep:

Per-user inverted index because users only search their own library - no cross-user ranking, so no global sharding required.
Index entries: (user_id, term, photo_id, score) where term is a face cluster ID, an object label, or a tokenized OCR span.
Sharded by user_id - all of one user's index entries land on one shard. Hot users (1M+ photos) get a dedicated shard.
Indexing is async via the ML pipeline queue. SLA: 5 min from upload to searchable. Acceptable because users are not actively searching their just-uploaded photo.
Storage: the index per user is small (~10MB for 10K photos with full ML annotations). Total: 1B × 10MB = 10 PB. Fits comfortably on wide-column storage.
Read path: query is parsed into terms, fanned to the user's shard, results merged in-memory, ranked by recency + score, top 100 returned.

Then: "What happens if the ML pipeline falls behind?"

Queue depth metric drives autoscaling on the worker fleet.
If queue grows beyond 1 hr of work, page on-call.
Soft degradation: photos are still uploadable + viewable; only search is degraded (recent uploads will not appear in results).
If a single photo's ML inference fails 3x (transient hardware error, model OOM on a giant image), it goes to a DLQ and a metric fires. Photo remains viewable, just unsearchable.

Tradeoffs (5 minutes)

"I picked async ML over sync because sync would push P99 upload latency from ~500ms to ~3-5s, which kills mobile UX. The cost is a 5-minute lag before search reflects new uploads, which I am OK with."
"I picked per-user index sharding over a global inverted index because the query pattern is single-tenant. A global index would require cross-shard fanout for every search and would not give us anything in return - users do not search other users' photos."
"If we added sharing, the per-user index breaks - I would need either a denormalized copy in the recipient's index, or a join at query time. I would lean denormalized at this scale and accept the storage overhead."

That is one round. The interviewer's eval is mostly about: did you ask the right clarifying questions, did your numbers make sense, did your deep dives have real substance, and could you articulate why you made each choice?

Common Pitfalls That Downlevel Strong Candidates

Drawing before clarifying. The single most common L5 → L4 downlevel reason. You started designing before you understood the problem. Recover: stop, walk back to requirements, redo.

Memorized template syndrome. You have a Twitter design memorized and you try to fit every prompt into it. Interviewers can tell within 60 seconds. The defense: practice on prompts you have never seen, with a timer, alone, until you can navigate them from first principles.

Hand-waving on storage. "We use a database" is not a system design answer. Pick a database family (relational, wide-column, document, key-value, graph) and justify it. Then talk about the actual schema.

Skipping numbers. If you do not do capacity estimation, you cannot justify any of your choices. If you do not have a number for QPS, you cannot say whether you need a cache. The numbers drive the design; without them, the design is vibes.

Talking instead of drawing. The whiteboard / virtual whiteboard exists for a reason. Strong candidates have 30+ boxes, lines, and annotations on the board by minute 30. Weak candidates have five boxes and a lot of words.

Refusing to commit. "It depends" is occasionally a good answer; as a default it is a terrible one. Pick a path, commit, and articulate the tradeoff. Interviewers want to see you make decisions under ambiguity, not perform analysis paralysis.

Recursing into other systems. "We use Kafka here, which uses ZooKeeper for coordination, which uses Zab for consensus, which is similar to Paxos..." You went down a hole. Come back up. The interviewer wants to see your design, not your knowledge of Kafka internals.

Forgetting failure modes. Strong systems are defined as much by what they do when they break as by what they do when they work. If you have not talked about replica failure, network partition, or cache stampede by minute 35, the interviewer will steer you there - and the steering itself costs signal.

How to Prepare in 30 Days

If you are starting from a reasonable baseline (you have shipped distributed systems and can talk through one):

Week 1: Read Designing Data-Intensive Applications, chapters 1-7. The vocabulary baseline - replication, partitioning, transactions, consistency models.
Week 2: Practice 5-7 classic prompts (Twitter, Uber, WhatsApp, YouTube, distributed cache, rate limiter, URL shortener). Time yourself at 45 minutes. Solo, with a whiteboard.
Week 3: Mock with someone who has actually done Google interviews. Two mocks minimum, ideally three. The feedback is what unlocks.
Week 4: Targeted weak-area drilling based on the mock feedback. If your capacity estimation is slow, do 20 estimation reps. If your tradeoff articulation is mushy, write out tradeoffs for every component of every prompt you have done.

Do not try to memorize designs. Memorization is what fails at Google. Practice the process - clarifying, estimating, deciding, justifying - until the process is reflexive and you can apply it to a prompt you have never seen.

gitGood walks through 21 system design topics with capacity estimation, deep dives, and tradeoff cards - calibrated to the L4-L6 bar. Premium, $5/month.