Design Search and Autocomplete (Elasticsearch-style)
Inverted indexes, BM25 ranking, prefix tries, and the p99 < 100ms latency budget that drives every architectural choice.
The problem
Design a search system that serves two related queries: (1) full-text search over a document corpus (products, articles, jobs, code), returning ranked results in under 100ms p99; (2) autocomplete suggestions as the user types, returning the top 10 candidates in under 50ms p99. Both must handle continuous index updates as the underlying corpus changes.
This problem is graded on three things: the choice of inverted index structure and how it shards, the ranking pipeline (lexical → learned), and the latency engineering that makes p99 < 100ms achievable. Strong candidates open with the latency budget and design backwards from it.
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →What's the corpus - documents (text-heavy), products (structured), code (tokens), or mixed?
- →What's the corpus size - millions, billions, or trillions of documents?
- →Index update frequency - real-time (search-as-you-publish), near-real-time (minutes), or batch (hours)?
- →What's the query volume - peak QPS, geographic distribution?
- →Ranking requirements - pure lexical (BM25), learned-to-rank, personalized, or combination?
- →Are there filters and facets (price range, category, date)? How dynamic?
- →Spell correction, synonyms, multi-language support?
- →Strict latency budget - typical product search is p99 < 100ms; is this fixed?
Requirements
Functional requirements
- ·GET /search?q=<query>&filters=...&page=N - returns ranked results with pagination
- ·GET /autocomplete?prefix=<chars> - returns top 10 query completions
- ·Index updates: insert, update, delete documents
- ·Filtering and faceted navigation (price ranges, categories, brand)
- ·Spell correction ("did you mean...")
- ·Personalization signals (user history, location)
- ·Multi-language tokenization and stemming
Non-functional requirements
- Scale
- 1B documents in the corpus, 100K queries/sec at peak, 10M document updates/day. Multi-region read replicas. ~5 TB raw corpus, ~25 TB indexed (with replicas).
- Latency
- Search p99 < 100ms (the search bar is on the user-facing path; every additional 100ms drops conversion measurably). Autocomplete p99 < 50ms (must keep up with typing). Index lag p95 < 60 seconds.
- Availability
- 99.95% for queries. Brief query outages degrade UX but don't lose data. Index updates can tolerate minutes of lag; complete outages of indexing are recoverable from the source of truth.
- Consistency
- Eventually consistent. A document just published can take seconds to appear in search results. Users tolerate this; the alternative (synchronous indexing on the write path) ruins write throughput.
Capacity estimation
Storage
- Corpus: 1B docs × 5KB avg = 5 TB raw.
- Inverted index: posting lists for each unique term. ~100M unique terms × ~10K avg docs per term × 8 bytes/posting = 8 TB. With positional info (for phrase queries): 2-3x = 20-25 TB.
- Term dictionary: 100M terms × 50 bytes (term + offset into postings) = 5 GB. Fits in memory per shard.
- Replication: 3x for availability. Total: ~75 TB across the fleet.
Throughput
- 100K queries/sec at peak. Each query touches ~5-10 shards (parallel sub-queries). Net per-shard query rate: ~10-50K QPS depending on shard count.
- Indexing: 10M updates/day = ~120 updates/sec average, peak ~1K/sec. Per-shard indexing throughput is rarely the bottleneck.
Memory
- Term dictionary in RAM (5 GB / shard count). Trivial.
- Hot posting lists (for common terms) in RAM. Maybe 10-20% of postings = 2-5 TB cached across the fleet.
- Filter caches (for frequent filter combinations): a few GB per shard.
Network
- Query: ~1KB request fan-out to N shards × ~10KB response per shard = 100KB internal traffic per query. At 100K QPS = 10 GB/s internal traffic. Significant - inter-AZ bandwidth costs add up.
Autocomplete
- Trie (or FST) of common queries: ~10M unique queries × 30 bytes avg = 300 MB. Fully in-memory per autocomplete node. Replicated everywhere.
High-level architecture
The system splits cleanly into three pipelines:
- Indexing pipeline (write path): document source → tokenization/analysis → index builders → shard distribution → searchable. Async; can lag the source by seconds.
- Query pipeline (read path): query parsing → fan-out to shards → per-shard scoring → merge + re-rank → response. Synchronous; on the user-facing latency budget.
- Autocomplete pipeline: separate, much smaller, prefix-trie based. Updated nightly from query logs.
The indexing/query split mirrors the read-write asymmetry seen in many systems but with one crucial difference: search reads aren't simple lookups, they're complex multi-shard scoring operations. The latency budget drives every architectural choice on the read path.
Document ingester
Consumes document changes from the source (Kafka topic, CDC from primary DB). Validates, normalizes, and forwards to the indexing pipeline.
Analyzer / tokenizer
Per-language tokenization (whitespace + Unicode for English; specialized for CJK, Arabic, etc.). Stemming (running → run). Stopword removal. Lowercasing. Synonym expansion.
Index builder
Builds inverted index segments. Each segment is an immutable Lucene-style file with term dictionary, posting lists, doc values (for filtering/sorting), and stored fields.
Shard / segment manager
Distributes new segments to shards. Manages segment merging (small segments → larger ones for query efficiency). Coordinates replica synchronization.
Query router (coordinator)
Receives query, parses, plans execution, fans out to relevant shards, gathers and merges results, applies global re-ranking, returns to client.
Shard (query node)
Holds a partition of the inverted index. Scores documents against the query (BM25), filters, returns top-K with scores to coordinator.
Re-ranking service
Optional second pass. Top 100-500 candidates from shards re-scored with a heavier ML model that uses features (user signals, freshness, business rules) too expensive to compute at lexical scoring time.
Autocomplete service
Independent fleet. Holds a trie / FST of common queries. Returns top-10 completions for a prefix in under 10ms.
Spell-correction service
Levenshtein-distance lookup against a dictionary. Triggered on low-confidence queries ("did you mean X?"). Often integrated with autocomplete.
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. Inverted index: the data structure that makes search fast
Naive search scans every document for the query term. At 1B documents, that's impossible. The inverted index flips the relationship: term → list of documents containing the term.
Posting list
For each term, a sorted list of (doc_id, term_frequency, [positions]) tuples. Sorted by doc_id for fast intersection.
For "design system" query:
- term "design" → [doc_5, doc_12, doc_47, doc_103, ...]
- term "system" → [doc_12, doc_47, doc_88, doc_103, ...]
- intersection (AND): [doc_12, doc_47, doc_103, ...]
Intersection of sorted lists is O(N + M) using a two-pointer merge. With skip lists (every Kth entry indexed), it becomes O(N/K + M/K) for queries that prune aggressively.
Term dictionary
Maps term → file offset of the posting list. Implementations:
- Hash table: O(1) lookup, no prefix queries.
- B-tree: O(log N) lookup, supports prefix iteration (needed for wildcard, autocomplete).
- FST (finite state transducer): O(prefix length), shared prefixes save memory. Lucene's choice.
Compression
Posting lists are huge but compressible. Doc IDs are sorted - delta-encode them (store [5, 7, 35, 36] as [5, +2, +28, +1]). Then variable-byte encoding or PFor encoding compresses small deltas. 5-10x compression ratios are typical.
Segment-based indexing (Lucene model)
Indexes are immutable. New documents go into a new segment (a small in-memory buffer flushed to disk as a complete unit). Periodic background merges combine small segments into larger ones to keep query-time fan-out manageable.
Pros: writes never block reads (new segment is independent). Easy to replicate (segments are files; ship to replicas). Easy to search across segments (each segment is a mini-index).
Cons: deletes are tombstones until the next merge. A document updated 100 times has 100 tombstones until merge.
Why immutability
Mutating posting lists in place under concurrent reads is a synchronization nightmare. Immutable segments + merges traded write amplification for query simplicity. Lucene's been doing it for 20 years; it works.
2. BM25, TF-IDF, and learned-to-rank
Scoring is what separates "find documents containing the terms" (trivial) from "rank them by relevance" (the actual product).
TF-IDF (the historical baseline)
score(doc, query) = sum over terms of: TF(term, doc) × IDF(term)
- TF: how often the term appears in the doc (more = more relevant).
- IDF: log(total docs / docs containing term). Rare terms are more discriminating.
Problems: doesn't account for document length (long docs get high TF artificially). No saturation (a doc with the term 1000 times isn't 1000x as relevant).
BM25 (the modern lexical default)
Refines TF-IDF with two parameters:
- k1: TF saturation (typical 1.2-2.0). After ~5 occurrences, more occurrences add diminishing relevance.
- b: length normalization (typical 0.75). Longer docs get penalized to balance against short docs.
BM25 is the default in Elasticsearch, Lucene, Solr. For pure lexical queries on text, it's hard to beat without ML.
Learned-to-rank (the modern winner for product search)
Train a model (gradient boosted trees - XGBoost, LightGBM - or neural ranker) on (query, document, relevance) tuples. Features include:
- BM25 score (input feature, not the only signal)
- Click-through rate (historical CTR for this query-doc pair)
- Document features (price, rating, popularity)
- User features (purchase history, location, device)
- Recency (how new is the document)
- Business signals (sponsored, in-stock, margin)
Training data comes from clickstream logs (clicked docs are positives, skipped ones are negatives). Pairwise loss (RankNet, LambdaMART) learns relative ordering, not absolute scores.
Two-stage ranking (production architecture)
Lexical score (BM25) is fast - run on every shard, return top-K (e.g., top 1000). Re-ranking model is expensive (per-doc inference) - run only on the top-K from coordinator after merge.
Stage 1 (per-shard, cheap): BM25 + filters. Returns top 1000. ~10ms per shard.
Stage 2 (coordinator, expensive): ML re-rank top 100-500. ~30-50ms total.
This balances comprehensiveness (BM25 catches the right candidates) with quality (ML picks the right order).
Personalization
User-specific features change ranking per user. Cache user features in Redis; lookup is sub-millisecond. Avoid putting per-user data into the index itself - it explodes the index size.
3. Sharding strategies and the fan-out cost
1B documents do not fit on one node. Sharding is required. The choice of shard key determines query patterns and operational complexity.
Document-based sharding (the standard)
Each document hashed to a shard by doc_id. All shards store full inverted indexes for their documents. Queries fan out to ALL shards (each shard might have matching docs); coordinator merges.
Pros: even data distribution, simple writes (one shard per doc), easy to add capacity (scale shards).
Cons: every query fans out N times. Latency = max(per-shard latency) + merge cost. Network amplification = N.
Term-based sharding
Each term hashed to a shard. A shard owns the full posting list for terms it owns. Queries hit only the shards containing query terms (potentially 1 shard for a single-term query).
Pros: less fan-out for common queries.
Cons: hot shards on common terms ("a", "the" - though stopwords mitigate). Multi-term queries require shipping posting lists between shards. Re-sharding is brutal.
In practice, document-based wins. The fan-out cost is real but manageable; the alternative is much worse operationally.
Replica strategy
Each shard has 2-3 replicas for availability and read throughput. Queries can be served by any replica - load-balance across them. Indexing writes go to the primary, replicate async.
Routing for tenanted systems
If your search is per-tenant (each customer has their own corpus), shard by tenant_id. Then queries hit only one shard per tenant. Massive savings - no fan-out at all. Only works if no cross-tenant queries are needed.
Adaptive sharding
Hot tenants get their own dedicated shards; small tenants share. Re-balance periodically based on traffic.
Coordinator latency engineering
Coordinator's job: parse query, fan out, gather, merge, return. The merge step is bounded - merging top-1000 from each of 50 shards = 50K candidates → top-100 → ~1ms. The dominant latency is the slowest shard's response (max latency, not average).
To control max latency: speculative replicas (issue the same query to two replicas, take whichever responds first). Costs 2x query load but cuts p99 dramatically. Used by Elasticsearch's "adaptive replica selection".
4. Autocomplete: the trie, the FST, and the popularity layer
Autocomplete is a different beast from full-text search. The query is short (1-20 chars), the response is small (top 10), and the latency budget is tight (p99 < 50ms).
Data structure: prefix trie
Each node is a character; paths from root to leaf spell out queries. Lookup of "siri" walks 4 nodes; the subtree under "siri" contains all completions.
For 10M unique queries × 30 chars avg = 300M nodes. ~10-20 bytes/node = 3-6 GB raw. Compressible.
FST (finite state transducer) - the production winner
A trie shares prefixes (good); an FST shares both prefixes AND suffixes. "test", "rest", "best", "fest" all share the "est" suffix. FSTs compress to a small fraction of trie size - tens of MB for 10M queries.
Used by Lucene's suggester, Algolia, Elasticsearch's completion suggester.
Ranking completions
Just listing all prefix matches doesn't work - you need the BEST 10. Each query in the trie has a popularity score (search count). At each node, store the top-K completions in the subtree, pre-sorted by score.
On lookup: walk to the prefix node, read pre-computed top-K. O(prefix length + K). Sub-millisecond.
Updating popularity
Daily batch job ingests query logs from the previous N days. Recomputes popularity scores. Rebuilds the FST. Hot-swaps it on autocomplete nodes.
Real-time popularity (a query suddenly trending) is a separate signal layer - small in-memory hot-trends index updated every few minutes, merged with the static popularity at lookup.
Personalization
Per-user query history lives in a separate index (Redis). On autocomplete request: lookup top static suggestions + top user-historical matches, merge. Users see their own past queries weighted up.
Spell correction integration
If the prefix has no good completions (low scores), try edit-distance lookups. "siiri" → suggest "siri" via Levenshtein-1. Triggered on low-confidence prefixes, not always.
Multi-language
One FST per language. Detect user's language from request headers / profile. Some apps run cross-language autocomplete (transliteration: typing "shukriya" suggests "thank you" in English).
Latency engineering
The autocomplete fleet is geographically distributed - serve from the user's region. FSTs are small enough to replicate everywhere. No fan-out, no coordinator, no merge - just trie traversal. Sub-10ms is achievable.
5. Indexing pipeline: real-time vs near-real-time vs batch
Updates to the corpus must reach the search index. The freshness target drives the pipeline architecture.
Batch (hours)
Source DB → nightly ETL → rebuild full index → swap. Simple, robust, terrible freshness. Use only if your corpus genuinely changes slowly (encyclopedia, reference data).
Near-real-time (minutes)
CDC (change data capture) from source DB → Kafka → indexing service → write to in-memory segment → flush to disk every N seconds → searchable.
Lucene/Elasticsearch's NRT model: writes go into an in-memory segment. Every 1-30 seconds, the segment is "refreshed" - flushed to a searchable file. Documents are queryable within the refresh interval.
Trade-off: shorter refresh = better freshness, more segments, more merge work, more memory churn. 1-second refresh is the sweet spot for most apps; default is often 1-5 seconds.
Real-time (sub-second)
The indexing service writes directly to the in-memory segment AND immediately to a per-shard "live" buffer that's queryable. Costs more memory and complicates the read path.
True real-time is rarely needed. Most "real-time" search use cases tolerate 1-2 seconds (social posts, e-commerce listings).
Update vs insert vs delete
Insert: append to the in-memory segment. Easy.
Delete: write a tombstone marker (doc_id is deleted). Filter at query time. Removed from disk on next merge.
Update: delete + insert. Same as the two above. Until merge, both old and new versions exist; tombstone hides the old.
Bulk indexing
Initial corpus load or full reindex (after schema change). Bulk API: batch 1000s of docs per request. Bypass the normal write path's per-doc overhead. Indexing rate: 10K-100K docs/sec/shard achievable.
Reindexing without downtime
Build the new index in parallel (different alias). Backfill from source. Switch traffic via alias swap. Drop old index. Standard "blue-green" pattern.
Failure recovery
Indexing service crashes mid-batch. Solutions:
- Source is durable (Kafka commits only after successful indexing).
- Per-document idempotency: deduplicate by doc_id during indexing.
- Replay from CDC log on recovery.
Backpressure
If the source publishes faster than indexing can absorb, the queue grows. Monitoring: lag in seconds (events behind real-time). Alerting at >5 minutes lag. Mitigation: shard expansion, faster hardware, or temporarily disable expensive analysis (synonym expansion, ML enrichment).
6. Filters, facets, and the relevance vs precision dance
Real product search isn't pure text matching. Filters narrow the set; facets show the user what filters are available; both must be fast.
Filters (boolean restrictions)
"Price < $50", "color = red", "in stock = true". Implemented via doc values (column-oriented per-document arrays) stored alongside the inverted index.
Query flow: posting list intersection (text terms) AND filter bitmap. Filter bitmaps are pre-computed and cached. Common filters (in-stock=true) are basically free at query time.
Facets (aggregations over the result set)
"Show me how many results in each color: red (12), blue (5), green (3)". Computed from doc values by scanning the matched doc IDs and aggregating per-field values.
Per-shard: each shard returns its local counts. Coordinator merges (sums per facet). Hot facets are cached.
For high-cardinality facets (1000s of brands), top-N approximation is faster than exact (return the top 50; mark "+ 200 more").
Filter ordering
Apply the most-selective filter first. If "color = red" matches 1% of docs and "in_stock" matches 99%, applying color first prunes the candidate set early. Query planner picks the order using cardinality stats.
Relevance vs precision
Filters are precise (no false positives). Text relevance (BM25) is fuzzy (high score doesn't mean perfect match). Combining them: filter first (precise), score remaining set (fuzzy), return top-K.
Query: "running shoes under $100, size 10".
- Filter: in_stock=true AND price<100 AND size=10. (precise, fast)
- Score: BM25 against "running shoes" on filtered set. (fuzzy, slower)
- Return top 50.
If the filter is too restrictive (no results), the system can suggest relaxing filters ("expand search to include price ≤ $150").
Faceted navigation UX patterns
- Show counts BEFORE the user clicks a filter. Computed live.
- Pre-select common filters (recommended size based on history).
- "Smart" facets: only show facets that meaningfully reduce the result set (don't show "color" if 99% of results are black).
Performance edge case
A query with no text and 5 broad filters can match millions of docs. The cost isn't scoring (no text to score), it's the deep aggregation across all matched docs. Mitigation: cap the result set (return top N matches by some default sort), aggressively cache aggregations.
Trade-offs
BM25 vs learned ranking
BM25 is fast, deterministic, and 80% as good as learned ranking for most queries. Learned ranking buys the last 20% (often the top-1 result) at significant infrastructure cost (training pipeline, feature store, online inference). Worth it for high-stakes search (ads, e-commerce); overkill for internal docs.
Document-based vs term-based sharding
Document-based wins in practice. Term-based has a niche where queries are usually single-term and shard count is high, but the operational complexity (re-sharding, hot terms) sinks it.
Real-time vs near-real-time indexing
Near-real-time (1-second refresh) is the sweet spot for almost all apps. Sub-second freshness costs disproportionate complexity for marginal user-perceived benefit.
Centralized coordinator vs distributed
A single coordinator per query simplifies merge logic but is a SPOF and a latency choke point. Modern systems use any-replica-can-coordinate models (Elasticsearch's coordinating-only nodes are optional). For ultra-low latency, push more logic to clients (smart-client pattern).
Index size vs query speed
More fields indexed = more flexibility, slower queries, larger storage. Less = the opposite. Rule of thumb: index only what you'll filter or score on; store (don't index) everything else for retrieval.
Synonyms and stemming
Aggressive analysis (stem "running" → "run", expand "shoes" → "footwear") increases recall but reduces precision. Tune per-corpus. Bad analysis ruins ranking; good analysis is invisible.
Personalization vs cacheability
Per-user ranking destroys query result caching. Two solutions: (a) cache only the lexical layer; re-rank per user (lighter but still per-user latency cost), (b) bucket users into ~100 segments; cache per-segment (compromise).
Ranking model freshness
Re-training the ranking model nightly captures evolving relevance signals. More frequent retraining (hourly) catches trending queries faster but risks overfitting to noise. Most production systems retrain daily; trending signals come from a separate online layer.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How do you handle a query for a brand-new product that has zero CTR signal?
- ?What's your strategy when a popular query suddenly returns no results (broken index, deleted product)?
- ?How would you support semantic search (vector embeddings) on top of this lexical baseline?
- ?What changes if the corpus becomes 100B documents (10x scale)?
- ?How do you A/B test ranking model changes safely in production?
- ?How would you implement search across structured (database) and unstructured (documents) data jointly?
- ?What's your strategy for query understanding (intent classification, named entity recognition)?
- ?How do you handle search for languages with no whitespace word boundaries (Chinese, Japanese)?
Related system design topics
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →