We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Politeness, deduplication, freshness, and the URL frontier. The classic crawl-the-internet question that surfaces deep distributed systems judgment.
Design a web crawler: a distributed system that fetches web pages, extracts links, follows them, stores content for downstream indexing, and returns to refresh pages over time. The output is a continuously updated corpus of the public web.
This is a foundational distributed systems question. Strong candidates demonstrate fluency in the URL frontier (the queue of work), politeness (don't hammer one host), deduplication (URL canonicalization, content hashing), and the freshness / coverage trade-off.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Storage
Bandwidth
Compute
Frontier
DNS
The crawler is a pipeline: URL frontier → DNS → fetcher → parser → dedupe → storage → link extractor (back to frontier). Each stage is horizontally scalable and decoupled by queues.
The hardest design decision is the URL frontier - it's the queue and the priority engine and the politeness enforcer rolled into one.
Multi-tier priority queue. Holds URLs to crawl. Implements politeness (per-host rate limit), priority (high-importance pages first), and freshness scheduling (when to re-crawl).
Resolves hostnames before fetch. Heavily cached. A miss adds 50-200ms to a fetch; aggressive caching is essential.
Stateless workers. Pull URLs from frontier, fetch HTTP/HTTPS, return raw response. Async I/O for high concurrency per node.
Strips boilerplate, extracts text content, extracts outbound links. Stateless.
URL canonicalization (lowercase, strip fragments, normalize query params) and content fingerprinting (SimHash, MinHash). Filters out duplicate content before storage.
Tiered. Recent crawls in fast storage (Cassandra, BigTable). Older versions in cold (S3, HDFS). Keyed by URL fingerprint.
Per-host rules cache. Refreshed on a separate cadence (daily). Crawlers consult before each fetch.
Decides when to re-crawl. Tracks per-URL change frequency, prioritizes pages that change often.
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
The frontier is the most subtle component. It's a priority queue with two competing constraints.
Priority
High-PageRank or high-business-value pages crawl first. Low-priority pages crawl last (or get dropped at high traffic).
Politeness
No more than N concurrent requests to a single host. Many hosts have weak servers; hammering them violates norms and gets you blocked.
Mercator-style two-tier design (the standard)
Worker loop:
This pattern naturally enforces politeness at the queue layer instead of in worker code.
Frontier persistence
30 TB of URLs doesn't fit in memory. Use Kafka or similar log-structured storage for the persistence layer; in-memory caches hold only the active working set.
Recovery
On crash, the frontier must be durable. URLs in flight (popped but not yet fetched) need an at-least-once delivery model with idempotent fetches.
Politeness is both ethical and self-protective. Sites that detect aggressive crawling block your IPs.
robots.txt
Per-host rules at /robots.txt. Specifies which paths are disallowed and the requested crawl-delay. Cache per-host with a TTL of ~24 hours; refresh proactively.
User-agent: *
Disallow: /admin/
Crawl-delay: 10
Crawl-delay enforcement
Default to 1 second per host even if robots.txt is silent. Production crawlers like Googlebot use server response time to dynamically tune politeness - if a server's response slows down, back off automatically.
Sitemap.xml
Many hosts publish a sitemap.xml listing important URLs and last-modified timestamps. Use it as a hint to prioritize new/changed content; don't trust it as ground truth (sites lie or omit).
User-Agent
Identify yourself. User-Agent: gitGoodBot/1.0 (+https://gitgood.dev/bot). Sites can contact you to complain or block. An anonymous crawler is treated as malicious.
IP rotation and rate limits
Too many requests from a single IP get rate-limited even if you're polite per-host. Distribute fetchers across multiple IPs and ASNs. For sensitive crawls, residential proxy pools.
The legal layer
Robots.txt is convention, not law. Some jurisdictions treat scraping as protected (HiQ vs LinkedIn). Others treat it as CFAA violation. Real crawlers maintain a blocklist of explicit cease-and-desists and respect them. Legal review is part of the design.
Deduplication happens at two layers: URL (don't re-fetch the same page) and content (don't re-store identical content from different URLs).
URL canonicalization
Different URLs can point to the same page:
http://Example.com/page?utm_source=twitterhttps://example.com/pagehttps://example.com/page#sectionhttps://example.com/page/Canonicalize: lowercase host, strip fragments, normalize query params (drop tracking params, sort remaining), strip trailing slash variations, normalize protocol. Hash the result; that's the URL key.
Seen URLs storage
100B URLs at 16 bytes each (hash) = 1.6 TB. Doesn't fit in any single instance.
Content deduplication: SimHash
Two URLs with the same content hash → same content. But minor variations (ad swaps, timestamp changes) defeat exact hashing.
SimHash: a locality-sensitive hash. Pages with high content similarity have low Hamming distance between their SimHashes. Threshold (e.g., < 3 bit differences) = duplicate.
Use case: detecting near-duplicate pages across mirrors and aggregators. Saves indexing cost.
Content addressability
If content is identical, store once and reference by content hash from multiple URLs. Reduces storage by 30-50% on real-world web (lots of duplicate content).
Pages change at radically different rates. CNN's homepage changes every minute; a 2008 academic paper hasn't changed in 15 years. Crawling everything at the same rate wastes capacity and misses fresh content.
Adaptive scheduling
Track per-URL change rate (Poisson process). If a page changed N times in the last 30 days, expected change interval = 30/N days. Schedule re-crawl at the expected change interval.
Priority re-crawl
Override adaptive scheduling for high-importance pages. News sites get crawled every few minutes regardless of change rate. Spam sites get crawled less often.
Change detection
Cheap version: HEAD request, compare Last-Modified or ETag. Many sites don't honor these.
Expensive version: full GET, parse, compare content hash to stored version.
Cost vs freshness budget
At 1B fetches/day, you can crawl 100M unique pages once per day (with 10x re-crawl on hot pages averaging out). Decide where to spend:
The interview signal
Candidates who acknowledge that the crawler can never be "fresh" - just "fresh enough for the use case" - score better than those who promise hourly refresh of the whole web.
The web is full of pages designed to break crawlers.
Calendar trap
A calendar widget with infinite "next month" links. The crawler follows them forever.
Mitigation: detect path patterns with steadily incrementing date components; cap depth.
Session ID trap
Every URL has a unique session ID (?sid=abc123). Each new URL looks fresh. Storage explodes.
Mitigation: drop common session-ID parameters during canonicalization.
Spider trap
Pages that link to themselves with slight URL variations. Bottomless recursion.
Mitigation: per-host crawl quota. If a host generates more than N URLs without returning ones already seen elsewhere, demote priority.
Cloaking
Server detects User-Agent: Googlebot and serves different content than to humans. Sometimes for legitimate reasons (mobile/desktop), often for SEO spam.
Mitigation: spot-check with multiple User-Agent strings; flag mismatches.
Anti-bot defenses
Cloudflare bot detection, JavaScript challenges, CAPTCHAs. Production crawlers respect "we don't want to be crawled" signals; aggressive crawlers use headless browsers and CAPTCHA-solving services.
Login walls and paywalls
Crawlable preview vs full content behind login. Decide policy: archive only the preview, or partner with the site for an authenticated feed.
Dynamic content and SPAs
JavaScript-heavy sites need a headless browser to render. 100x more expensive than static fetch. Use a separate "render-required" pipeline for sites that justify the cost.
Different consumers need different views of the corpus.
Raw HTML store
Original bytes, gzipped. Keyed by (URL, fetch_timestamp). Append-only.
Parsed content store
Extracted text, metadata, links. Keyed by URL. One row per URL (latest version).
Inverted index (for search consumers)
Built downstream from parsed content. Term → list of URLs. Sharded by term hash.
Link graph
URL → outbound URLs. Used for PageRank computation, structural analysis. Graph storage (custom) or wide-row Cassandra.
Snapshot vs incremental
Search engines run periodic snapshot indexes (e.g., daily) plus an incremental index for very fresh content (last hour). Crawler outputs to both.
Versioning
For pages that change, do you keep history?
The interview signal
Candidates who think about consumers of the corpus rather than just storage scale show product judgment. The crawler is an input to many systems; design with that in mind.
Coverage vs freshness
Crawl every page once a year (good coverage) vs crawl the top 1% every minute (good freshness). At fixed budget you choose. Real systems do tiered: high coverage + targeted freshness for high-value pages.
Politeness vs throughput
Aggressive crawling hits more pages but burns relationships, gets you blocked, and stresses target servers. Default to over-polite (1 req/sec/host) and only ramp up where explicitly allowed.
Static fetch vs JS rendering
Static fetch is 100x cheaper. JS rendering catches the increasing share of SPA-rendered content. Modern crawlers operate two pipelines and route based on content type / user-agent / explicit hints.
Centralized frontier vs distributed
Centralized frontier is simpler but a bottleneck. Distributed (per-region or per-shard) is faster but creates duplicate URL discovery. Real systems shard the frontier by URL hash with a small dedup layer.
Storage cost vs version retention
Keeping every version of every page means PB-scale growth per year. Most systems retain a small rolling window (last N versions) plus a snapshot per epoch.
Real-time crawling vs batch
Batch (process URL frontier in waves) is throughput-efficient. Real-time (stream URLs as discovered) is freshness-efficient. The trade-off shows up in pipeline complexity and queue design.
Be ready for at least three of these. The first one is almost always asked.
Consistent hashing, eviction, replication, and what really happens when a single hot key takes down the cluster.
Five algorithms, three sharding strategies, one fail-open vs fail-closed decision. The bounded design that surfaces in every backend interview loop.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →