Distributed Systems

Design a Web Crawler

Hard Premium

Politeness, deduplication, freshness, and the URL frontier. The classic crawl-the-internet question that surfaces deep distributed systems judgment.

The problem

Design a web crawler: a distributed system that fetches web pages, extracts links, follows them, stores content for downstream indexing, and returns to refresh pages over time. The output is a continuously updated corpus of the public web.

This is a foundational distributed systems question. Strong candidates demonstrate fluency in the URL frontier (the queue of work), politeness (don't hammer one host), deduplication (URL canonicalization, content hashing), and the freshness / coverage trade-off.

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

→What's the scope - the entire public web, a specific domain set, or a vertical (news, ecommerce)?
→What's the target corpus size - 1B pages, 100B pages, or a small set?
→How fresh does content need to be - news refresh in minutes, evergreen content in weeks?
→What's downstream - search indexing, ML training data, archive, or alerting?
→Do we need to render JavaScript (full browser) or only static HTML?
→Are we crawling adversarial sites - cloaking, anti-bot defenses, login walls?
→What's the bandwidth budget - we control how much of someone's site we hit per minute?
→Are we respecting robots.txt strictly, or are there business reasons to override?

Requirements

Functional requirements

·Fetch HTML pages from a URL frontier
·Extract outbound links and add them to the frontier
·Deduplicate URLs (canonicalization) and content (fingerprinting)
·Respect robots.txt and crawl-delay directives
·Politeness: rate-limit requests per host
·Schedule re-crawls based on observed change frequency
·Store fetched content for downstream indexing

Non-functional requirements

Scale: 100B pages in corpus. 1B fetches/day (re-crawl + new). 100M unique hosts. 200KB average page size. 5-year retention with tiered storage.
Latency: Per-fetch latency depends on the target server (we don't control it). Internal pipeline latency from frontier → fetch → store should add < 100ms.
Availability: 99.9% on the crawl pipeline. A 1-hour outage means missing 40M pages worth of refresh - recoverable but visible. Storage availability: 99.99%.
Consistency: Eventually consistent. The crawler is fundamentally an eventually-consistent view of the web. Strong consistency would mean stop-the-world snapshots, which doesn't fit the workload.

Capacity estimation

Storage

100B pages × 200KB raw HTML = 20 PB raw. Compressed (gzip): ~5 PB.
Plus extracted text, metadata, links: ~30% overhead = 6.5 PB compressed total.
Re-crawls overwrite, so storage is bounded by corpus size, not crawl rate.

Bandwidth

1B fetches/day × 200KB = 200 TB/day = 2.3 GB/s sustained ingress.
At 10 Gbps per fetcher node = 1.25 GB/s per node, need ~2-3 fetcher nodes per region for the bandwidth alone, more for latency parallelism.

Compute

HTML parsing + link extraction: ~10ms per page × 1B/day = 100K core-hours/day = ~4K cores at 100% utilization. Realistically 8-10K cores for headroom.
JavaScript rendering (if required): ~1-3 seconds per page × subset of pages. Adds 100-300x more compute - keep this off the hot path.

Frontier

100B URLs in the frontier. ~100B × 300 bytes (URL + metadata) = 30 TB. Doesn't fit in memory; needs tiered storage.

DNS

100M unique hosts × 1 DNS lookup per crawl day = 100M DNS lookups/day. Cache aggressively (long TTL); push to local DNS resolvers.

High-level architecture

The crawler is a pipeline: URL frontier → DNS → fetcher → parser → dedupe → storage → link extractor (back to frontier). Each stage is horizontally scalable and decoupled by queues.

The hardest design decision is the URL frontier - it's the queue and the priority engine and the politeness enforcer rolled into one.

URL frontier

Multi-tier priority queue. Holds URLs to crawl. Implements politeness (per-host rate limit), priority (high-importance pages first), and freshness scheduling (when to re-crawl).

DNS resolver

Resolves hostnames before fetch. Heavily cached. A miss adds 50-200ms to a fetch; aggressive caching is essential.

Fetcher pool

Stateless workers. Pull URLs from frontier, fetch HTTP/HTTPS, return raw response. Async I/O for high concurrency per node.

Parser

Strips boilerplate, extracts text content, extracts outbound links. Stateless.

Dedupe service

URL canonicalization (lowercase, strip fragments, normalize query params) and content fingerprinting (SimHash, MinHash). Filters out duplicate content before storage.

Content store

Tiered. Recent crawls in fast storage (Cassandra, BigTable). Older versions in cold (S3, HDFS). Keyed by URL fingerprint.

Robots.txt cache

Per-host rules cache. Refreshed on a separate cadence (daily). Crawlers consult before each fetch.

Scheduler

Decides when to re-crawl. Tracks per-URL change frequency, prioritizes pages that change often.

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. URL frontier: the heart of the crawler

The frontier is the most subtle component. It's a priority queue with two competing constraints.

Priority
High-PageRank or high-business-value pages crawl first. Low-priority pages crawl last (or get dropped at high traffic).

Politeness
No more than N concurrent requests to a single host. Many hosts have weak servers; hammering them violates norms and gets you blocked.

Mercator-style two-tier design (the standard)

Front queues: per-priority queue. Higher-priority queues have more frequent draws.
Back queues: per-host queue. URLs from front queues are routed to back queues by host hash. Each back queue feeds at most one fetcher worker at a time, enforcing per-host concurrency = 1.

Worker loop:

Pick a back queue using a priority weighting (higher-priority hosts get picked more often).
Pop the next URL.
Fetch.
Wait for the per-host crawl-delay before this back queue is eligible again.

This pattern naturally enforces politeness at the queue layer instead of in worker code.

Frontier persistence
30 TB of URLs doesn't fit in memory. Use Kafka or similar log-structured storage for the persistence layer; in-memory caches hold only the active working set.

Recovery
On crash, the frontier must be durable. URLs in flight (popped but not yet fetched) need an at-least-once delivery model with idempotent fetches.

2. Politeness and robots.txt

Politeness is both ethical and self-protective. Sites that detect aggressive crawling block your IPs.

robots.txt
Per-host rules at /robots.txt. Specifies which paths are disallowed and the requested crawl-delay. Cache per-host with a TTL of ~24 hours; refresh proactively.

User-agent: *
Disallow: /admin/
Crawl-delay: 10

Crawl-delay enforcement
Default to 1 second per host even if robots.txt is silent. Production crawlers like Googlebot use server response time to dynamically tune politeness - if a server's response slows down, back off automatically.

Sitemap.xml
Many hosts publish a sitemap.xml listing important URLs and last-modified timestamps. Use it as a hint to prioritize new/changed content; don't trust it as ground truth (sites lie or omit).

User-Agent
Identify yourself. User-Agent: gitGoodBot/1.0 (+https://gitgood.dev/bot). Sites can contact you to complain or block. An anonymous crawler is treated as malicious.

IP rotation and rate limits
Too many requests from a single IP get rate-limited even if you're polite per-host. Distribute fetchers across multiple IPs and ASNs. For sensitive crawls, residential proxy pools.

The legal layer
Robots.txt is convention, not law. Some jurisdictions treat scraping as protected (HiQ vs LinkedIn). Others treat it as CFAA violation. Real crawlers maintain a blocklist of explicit cease-and-desists and respect them. Legal review is part of the design.

3. Deduplication: URLs and content

Deduplication happens at two layers: URL (don't re-fetch the same page) and content (don't re-store identical content from different URLs).

URL canonicalization
Different URLs can point to the same page:

http://Example.com/page?utm_source=twitter
https://example.com/page
https://example.com/page#section
https://example.com/page/

Canonicalize: lowercase host, strip fragments, normalize query params (drop tracking params, sort remaining), strip trailing slash variations, normalize protocol. Hash the result; that's the URL key.

Seen URLs storage
100B URLs at 16 bytes each (hash) = 1.6 TB. Doesn't fit in any single instance.

Bloom filter for "definitely not seen" check. Per-region bloom filter, sized for 99% accuracy. False positives mean we re-crawl an already-seen URL (wasteful but correct).
Persistent seen-set (Cassandra, RocksDB) for confirmed dedup.

Content deduplication: SimHash
Two URLs with the same content hash → same content. But minor variations (ad swaps, timestamp changes) defeat exact hashing.

SimHash: a locality-sensitive hash. Pages with high content similarity have low Hamming distance between their SimHashes. Threshold (e.g., < 3 bit differences) = duplicate.

Use case: detecting near-duplicate pages across mirrors and aggregators. Saves indexing cost.

Content addressability
If content is identical, store once and reference by content hash from multiple URLs. Reduces storage by 30-50% on real-world web (lots of duplicate content).

4. Freshness: when to re-crawl

Pages change at radically different rates. CNN's homepage changes every minute; a 2008 academic paper hasn't changed in 15 years. Crawling everything at the same rate wastes capacity and misses fresh content.

Adaptive scheduling
Track per-URL change rate (Poisson process). If a page changed N times in the last 30 days, expected change interval = 30/N days. Schedule re-crawl at the expected change interval.

Priority re-crawl
Override adaptive scheduling for high-importance pages. News sites get crawled every few minutes regardless of change rate. Spam sites get crawled less often.

Change detection
Cheap version: HEAD request, compare Last-Modified or ETag. Many sites don't honor these.
Expensive version: full GET, parse, compare content hash to stored version.

Cost vs freshness budget
At 1B fetches/day, you can crawl 100M unique pages once per day (with 10x re-crawl on hot pages averaging out). Decide where to spend:

50% of budget: top 1M URLs (re-crawled every 15 minutes).
30% of budget: next 50M URLs (re-crawled daily).
20% of budget: discovery (new URLs from frontier).

The interview signal
Candidates who acknowledge that the crawler can never be "fresh" - just "fresh enough for the use case" - score better than those who promise hourly refresh of the whole web.

5. Crawl traps and adversarial sites

The web is full of pages designed to break crawlers.

Calendar trap
A calendar widget with infinite "next month" links. The crawler follows them forever.
Mitigation: detect path patterns with steadily incrementing date components; cap depth.

Session ID trap
Every URL has a unique session ID (?sid=abc123). Each new URL looks fresh. Storage explodes.
Mitigation: drop common session-ID parameters during canonicalization.

Spider trap
Pages that link to themselves with slight URL variations. Bottomless recursion.
Mitigation: per-host crawl quota. If a host generates more than N URLs without returning ones already seen elsewhere, demote priority.

Cloaking
Server detects User-Agent: Googlebot and serves different content than to humans. Sometimes for legitimate reasons (mobile/desktop), often for SEO spam.
Mitigation: spot-check with multiple User-Agent strings; flag mismatches.

Anti-bot defenses
Cloudflare bot detection, JavaScript challenges, CAPTCHAs. Production crawlers respect "we don't want to be crawled" signals; aggressive crawlers use headless browsers and CAPTCHA-solving services.

Login walls and paywalls
Crawlable preview vs full content behind login. Decide policy: archive only the preview, or partner with the site for an authenticated feed.

Dynamic content and SPAs
JavaScript-heavy sites need a headless browser to render. 100x more expensive than static fetch. Use a separate "render-required" pipeline for sites that justify the cost.

6. Storage layout and downstream consumers

Different consumers need different views of the corpus.

Raw HTML store
Original bytes, gzipped. Keyed by (URL, fetch_timestamp). Append-only.

Hot tier (last 30 days): Cassandra/BigTable. Sub-second random access.
Cold tier (>30 days): S3/HDFS in larger blobs. Batch-only access.

Parsed content store
Extracted text, metadata, links. Keyed by URL. One row per URL (latest version).

Inverted index (for search consumers)
Built downstream from parsed content. Term → list of URLs. Sharded by term hash.

Link graph
URL → outbound URLs. Used for PageRank computation, structural analysis. Graph storage (custom) or wide-row Cassandra.

Snapshot vs incremental
Search engines run periodic snapshot indexes (e.g., daily) plus an incremental index for very fresh content (last hour). Crawler outputs to both.

Versioning
For pages that change, do you keep history?

Internet Archive: yes, full versioned corpus.
Search engines: usually no, just latest version + a short rolling history for change detection.
Web Vault / archive use cases: yes, multi-version.

The interview signal
Candidates who think about consumers of the corpus rather than just storage scale show product judgment. The crawler is an input to many systems; design with that in mind.

Trade-offs

Coverage vs freshness
Crawl every page once a year (good coverage) vs crawl the top 1% every minute (good freshness). At fixed budget you choose. Real systems do tiered: high coverage + targeted freshness for high-value pages.

Politeness vs throughput
Aggressive crawling hits more pages but burns relationships, gets you blocked, and stresses target servers. Default to over-polite (1 req/sec/host) and only ramp up where explicitly allowed.

Static fetch vs JS rendering
Static fetch is 100x cheaper. JS rendering catches the increasing share of SPA-rendered content. Modern crawlers operate two pipelines and route based on content type / user-agent / explicit hints.

Centralized frontier vs distributed
Centralized frontier is simpler but a bottleneck. Distributed (per-region or per-shard) is faster but creates duplicate URL discovery. Real systems shard the frontier by URL hash with a small dedup layer.

Storage cost vs version retention
Keeping every version of every page means PB-scale growth per year. Most systems retain a small rolling window (last N versions) plus a snapshot per epoch.

Real-time crawling vs batch
Batch (process URL frontier in waves) is throughput-efficient. Real-time (stream URLs as discovered) is freshness-efficient. The trade-off shows up in pipeline complexity and queue design.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

?How do you detect when a previously-good site has been compromised (malware injection, defacement)?
?How do you handle a site that returns 200 OK with an empty body (silent failure)?
?What changes if you need to crawl 1B pages per hour instead of per day?
?How would you rebuild the corpus from scratch if it was corrupted?
?What's your strategy for detecting and demoting spam / link-farm sites?
?How do you handle GDPR / right-to-be-forgotten requests retroactively?
?How would you design the system to support archived multi-version retrieval (Internet Archive style)?
?What happens during a network partition between the frontier and the fetchers?

Companies that test this topic

Google

L4 Software Engineer

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →