We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Inverted indexes, BM25 ranking, prefix tries, and the p99 < 100ms latency budget that drives every architectural choice.
Design a search system that serves two related queries: (1) full-text search over a document corpus (products, articles, jobs, code), returning ranked results in under 100ms p99; (2) autocomplete suggestions as the user types, returning the top 10 candidates in under 50ms p99. Both must handle continuous index updates as the underlying corpus changes.
This problem is graded on three things: the choice of inverted index structure and how it shards, the ranking pipeline (lexical → learned), and the latency engineering that makes p99 < 100ms achievable. Strong candidates open with the latency budget and design backwards from it.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Storage
Throughput
Memory
Network
Autocomplete
The system splits cleanly into three pipelines:
The indexing/query split mirrors the read-write asymmetry seen in many systems but with one crucial difference: search reads aren't simple lookups, they're complex multi-shard scoring operations. The latency budget drives every architectural choice on the read path.
Consumes document changes from the source (Kafka topic, CDC from primary DB). Validates, normalizes, and forwards to the indexing pipeline.
Per-language tokenization (whitespace + Unicode for English; specialized for CJK, Arabic, etc.). Stemming (running → run). Stopword removal. Lowercasing. Synonym expansion.
Builds inverted index segments. Each segment is an immutable Lucene-style file with term dictionary, posting lists, doc values (for filtering/sorting), and stored fields.
Distributes new segments to shards. Manages segment merging (small segments → larger ones for query efficiency). Coordinates replica synchronization.
Receives query, parses, plans execution, fans out to relevant shards, gathers and merges results, applies global re-ranking, returns to client.
Holds a partition of the inverted index. Scores documents against the query (BM25), filters, returns top-K with scores to coordinator.
Optional second pass. Top 100-500 candidates from shards re-scored with a heavier ML model that uses features (user signals, freshness, business rules) too expensive to compute at lexical scoring time.
Independent fleet. Holds a trie / FST of common queries. Returns top-10 completions for a prefix in under 10ms.
Levenshtein-distance lookup against a dictionary. Triggered on low-confidence queries ("did you mean X?"). Often integrated with autocomplete.
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
Naive search scans every document for the query term. At 1B documents, that's impossible. The inverted index flips the relationship: term → list of documents containing the term.
Posting list
For each term, a sorted list of (doc_id, term_frequency, [positions]) tuples. Sorted by doc_id for fast intersection.
For "design system" query:
Intersection of sorted lists is O(N + M) using a two-pointer merge. With skip lists (every Kth entry indexed), it becomes O(N/K + M/K) for queries that prune aggressively.
Term dictionary
Maps term → file offset of the posting list. Implementations:
Compression
Posting lists are huge but compressible. Doc IDs are sorted - delta-encode them (store [5, 7, 35, 36] as [5, +2, +28, +1]). Then variable-byte encoding or PFor encoding compresses small deltas. 5-10x compression ratios are typical.
Segment-based indexing (Lucene model)
Indexes are immutable. New documents go into a new segment (a small in-memory buffer flushed to disk as a complete unit). Periodic background merges combine small segments into larger ones to keep query-time fan-out manageable.
Pros: writes never block reads (new segment is independent). Easy to replicate (segments are files; ship to replicas). Easy to search across segments (each segment is a mini-index).
Cons: deletes are tombstones until the next merge. A document updated 100 times has 100 tombstones until merge.
Why immutability
Mutating posting lists in place under concurrent reads is a synchronization nightmare. Immutable segments + merges traded write amplification for query simplicity. Lucene's been doing it for 20 years; it works.
Scoring is what separates "find documents containing the terms" (trivial) from "rank them by relevance" (the actual product).
TF-IDF (the historical baseline)
score(doc, query) = sum over terms of: TF(term, doc) × IDF(term)
Problems: doesn't account for document length (long docs get high TF artificially). No saturation (a doc with the term 1000 times isn't 1000x as relevant).
BM25 (the modern lexical default)
Refines TF-IDF with two parameters:
BM25 is the default in Elasticsearch, Lucene, Solr. For pure lexical queries on text, it's hard to beat without ML.
Learned-to-rank (the modern winner for product search)
Train a model (gradient boosted trees - XGBoost, LightGBM - or neural ranker) on (query, document, relevance) tuples. Features include:
Training data comes from clickstream logs (clicked docs are positives, skipped ones are negatives). Pairwise loss (RankNet, LambdaMART) learns relative ordering, not absolute scores.
Two-stage ranking (production architecture)
Lexical score (BM25) is fast - run on every shard, return top-K (e.g., top 1000). Re-ranking model is expensive (per-doc inference) - run only on the top-K from coordinator after merge.
Stage 1 (per-shard, cheap): BM25 + filters. Returns top 1000. ~10ms per shard.
Stage 2 (coordinator, expensive): ML re-rank top 100-500. ~30-50ms total.
This balances comprehensiveness (BM25 catches the right candidates) with quality (ML picks the right order).
Personalization
User-specific features change ranking per user. Cache user features in Redis; lookup is sub-millisecond. Avoid putting per-user data into the index itself - it explodes the index size.
1B documents do not fit on one node. Sharding is required. The choice of shard key determines query patterns and operational complexity.
Document-based sharding (the standard)
Each document hashed to a shard by doc_id. All shards store full inverted indexes for their documents. Queries fan out to ALL shards (each shard might have matching docs); coordinator merges.
Pros: even data distribution, simple writes (one shard per doc), easy to add capacity (scale shards).
Cons: every query fans out N times. Latency = max(per-shard latency) + merge cost. Network amplification = N.
Term-based sharding
Each term hashed to a shard. A shard owns the full posting list for terms it owns. Queries hit only the shards containing query terms (potentially 1 shard for a single-term query).
Pros: less fan-out for common queries.
Cons: hot shards on common terms ("a", "the" - though stopwords mitigate). Multi-term queries require shipping posting lists between shards. Re-sharding is brutal.
In practice, document-based wins. The fan-out cost is real but manageable; the alternative is much worse operationally.
Replica strategy
Each shard has 2-3 replicas for availability and read throughput. Queries can be served by any replica - load-balance across them. Indexing writes go to the primary, replicate async.
Routing for tenanted systems
If your search is per-tenant (each customer has their own corpus), shard by tenant_id. Then queries hit only one shard per tenant. Massive savings - no fan-out at all. Only works if no cross-tenant queries are needed.
Adaptive sharding
Hot tenants get their own dedicated shards; small tenants share. Re-balance periodically based on traffic.
Coordinator latency engineering
Coordinator's job: parse query, fan out, gather, merge, return. The merge step is bounded - merging top-1000 from each of 50 shards = 50K candidates → top-100 → ~1ms. The dominant latency is the slowest shard's response (max latency, not average).
To control max latency: speculative replicas (issue the same query to two replicas, take whichever responds first). Costs 2x query load but cuts p99 dramatically. Used by Elasticsearch's "adaptive replica selection".
Autocomplete is a different beast from full-text search. The query is short (1-20 chars), the response is small (top 10), and the latency budget is tight (p99 < 50ms).
Data structure: prefix trie
Each node is a character; paths from root to leaf spell out queries. Lookup of "siri" walks 4 nodes; the subtree under "siri" contains all completions.
For 10M unique queries × 30 chars avg = 300M nodes. ~10-20 bytes/node = 3-6 GB raw. Compressible.
FST (finite state transducer) - the production winner
A trie shares prefixes (good); an FST shares both prefixes AND suffixes. "test", "rest", "best", "fest" all share the "est" suffix. FSTs compress to a small fraction of trie size - tens of MB for 10M queries.
Used by Lucene's suggester, Algolia, Elasticsearch's completion suggester.
Ranking completions
Just listing all prefix matches doesn't work - you need the BEST 10. Each query in the trie has a popularity score (search count). At each node, store the top-K completions in the subtree, pre-sorted by score.
On lookup: walk to the prefix node, read pre-computed top-K. O(prefix length + K). Sub-millisecond.
Updating popularity
Daily batch job ingests query logs from the previous N days. Recomputes popularity scores. Rebuilds the FST. Hot-swaps it on autocomplete nodes.
Real-time popularity (a query suddenly trending) is a separate signal layer - small in-memory hot-trends index updated every few minutes, merged with the static popularity at lookup.
Personalization
Per-user query history lives in a separate index (Redis). On autocomplete request: lookup top static suggestions + top user-historical matches, merge. Users see their own past queries weighted up.
Spell correction integration
If the prefix has no good completions (low scores), try edit-distance lookups. "siiri" → suggest "siri" via Levenshtein-1. Triggered on low-confidence prefixes, not always.
Multi-language
One FST per language. Detect user's language from request headers / profile. Some apps run cross-language autocomplete (transliteration: typing "shukriya" suggests "thank you" in English).
Latency engineering
The autocomplete fleet is geographically distributed - serve from the user's region. FSTs are small enough to replicate everywhere. No fan-out, no coordinator, no merge - just trie traversal. Sub-10ms is achievable.
Updates to the corpus must reach the search index. The freshness target drives the pipeline architecture.
Batch (hours)
Source DB → nightly ETL → rebuild full index → swap. Simple, robust, terrible freshness. Use only if your corpus genuinely changes slowly (encyclopedia, reference data).
Near-real-time (minutes)
CDC (change data capture) from source DB → Kafka → indexing service → write to in-memory segment → flush to disk every N seconds → searchable.
Lucene/Elasticsearch's NRT model: writes go into an in-memory segment. Every 1-30 seconds, the segment is "refreshed" - flushed to a searchable file. Documents are queryable within the refresh interval.
Trade-off: shorter refresh = better freshness, more segments, more merge work, more memory churn. 1-second refresh is the sweet spot for most apps; default is often 1-5 seconds.
Real-time (sub-second)
The indexing service writes directly to the in-memory segment AND immediately to a per-shard "live" buffer that's queryable. Costs more memory and complicates the read path.
True real-time is rarely needed. Most "real-time" search use cases tolerate 1-2 seconds (social posts, e-commerce listings).
Update vs insert vs delete
Insert: append to the in-memory segment. Easy.
Delete: write a tombstone marker (doc_id is deleted). Filter at query time. Removed from disk on next merge.
Update: delete + insert. Same as the two above. Until merge, both old and new versions exist; tombstone hides the old.
Bulk indexing
Initial corpus load or full reindex (after schema change). Bulk API: batch 1000s of docs per request. Bypass the normal write path's per-doc overhead. Indexing rate: 10K-100K docs/sec/shard achievable.
Reindexing without downtime
Build the new index in parallel (different alias). Backfill from source. Switch traffic via alias swap. Drop old index. Standard "blue-green" pattern.
Failure recovery
Indexing service crashes mid-batch. Solutions:
Backpressure
If the source publishes faster than indexing can absorb, the queue grows. Monitoring: lag in seconds (events behind real-time). Alerting at >5 minutes lag. Mitigation: shard expansion, faster hardware, or temporarily disable expensive analysis (synonym expansion, ML enrichment).
Real product search isn't pure text matching. Filters narrow the set; facets show the user what filters are available; both must be fast.
Filters (boolean restrictions)
"Price < $50", "color = red", "in stock = true". Implemented via doc values (column-oriented per-document arrays) stored alongside the inverted index.
Query flow: posting list intersection (text terms) AND filter bitmap. Filter bitmaps are pre-computed and cached. Common filters (in-stock=true) are basically free at query time.
Facets (aggregations over the result set)
"Show me how many results in each color: red (12), blue (5), green (3)". Computed from doc values by scanning the matched doc IDs and aggregating per-field values.
Per-shard: each shard returns its local counts. Coordinator merges (sums per facet). Hot facets are cached.
For high-cardinality facets (1000s of brands), top-N approximation is faster than exact (return the top 50; mark "+ 200 more").
Filter ordering
Apply the most-selective filter first. If "color = red" matches 1% of docs and "in_stock" matches 99%, applying color first prunes the candidate set early. Query planner picks the order using cardinality stats.
Relevance vs precision
Filters are precise (no false positives). Text relevance (BM25) is fuzzy (high score doesn't mean perfect match). Combining them: filter first (precise), score remaining set (fuzzy), return top-K.
Query: "running shoes under $100, size 10".
If the filter is too restrictive (no results), the system can suggest relaxing filters ("expand search to include price ≤ $150").
Faceted navigation UX patterns
Performance edge case
A query with no text and 5 broad filters can match millions of docs. The cost isn't scoring (no text to score), it's the deep aggregation across all matched docs. Mitigation: cap the result set (return top N matches by some default sort), aggressively cache aggregations.
BM25 vs learned ranking
BM25 is fast, deterministic, and 80% as good as learned ranking for most queries. Learned ranking buys the last 20% (often the top-1 result) at significant infrastructure cost (training pipeline, feature store, online inference). Worth it for high-stakes search (ads, e-commerce); overkill for internal docs.
Document-based vs term-based sharding
Document-based wins in practice. Term-based has a niche where queries are usually single-term and shard count is high, but the operational complexity (re-sharding, hot terms) sinks it.
Real-time vs near-real-time indexing
Near-real-time (1-second refresh) is the sweet spot for almost all apps. Sub-second freshness costs disproportionate complexity for marginal user-perceived benefit.
Centralized coordinator vs distributed
A single coordinator per query simplifies merge logic but is a SPOF and a latency choke point. Modern systems use any-replica-can-coordinate models (Elasticsearch's coordinating-only nodes are optional). For ultra-low latency, push more logic to clients (smart-client pattern).
Index size vs query speed
More fields indexed = more flexibility, slower queries, larger storage. Less = the opposite. Rule of thumb: index only what you'll filter or score on; store (don't index) everything else for retrieval.
Synonyms and stemming
Aggressive analysis (stem "running" → "run", expand "shoes" → "footwear") increases recall but reduces precision. Tune per-corpus. Bad analysis ruins ranking; good analysis is invisible.
Personalization vs cacheability
Per-user ranking destroys query result caching. Two solutions: (a) cache only the lexical layer; re-rank per user (lighter but still per-user latency cost), (b) bucket users into ~100 segments; cache per-segment (compromise).
Ranking model freshness
Re-training the ranking model nightly captures evolving relevance signals. More frequent retraining (hourly) catches trending queries faster but risks overfitting to noise. Most production systems retrain daily; trending signals come from a separate online layer.
Be ready for at least three of these. The first one is almost always asked.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →