What is RAG and why is it a common interview topic in 2026?

RAG (Retrieval-Augmented Generation) combines a retrieval step over your own data with a language model's generation, so answers are grounded in up-to-date, domain-specific context instead of just the model's training data. It's a common interview topic because nearly every company building with LLMs now ships RAG features, making it its own interview category for AI and ML engineering roles.

What RAG topics do interviewers focus on most?

Interviewers probe the full pipeline: chunking strategy, embeddings, vector databases, retrieval strategies like hybrid search, reranking, evaluation of retrieval quality and generation faithfulness, and production concerns such as stale indexes, latency, cost, and hallucination. They want trade-off reasoning and real production experience, not textbook definitions.

How do I prepare for a RAG interview?

Be ready to walk through an end-to-end RAG system and justify each design choice on latency, accuracy, cost, and complexity. Practice describing failure modes you've hit (bad chunks, stale data, hallucination) and how you measured and fixed them. A timed AI mock interview that probes for depth on these topics is the closest simulation.

What's the difference between RAG and fine-tuning?

RAG injects relevant external knowledge at query time via retrieval, which is ideal for frequently changing or proprietary data and keeps sources citable. Fine-tuning adjusts the model's weights to change behavior, style, or specialized skills. Many production systems combine both - fine-tuning for format and tone, RAG for current facts.

What is agentic RAG and how do interviewers test it?

Agentic RAG makes retrieval a tool call inside an agent loop: instead of a fixed retrieve-then-generate pipeline, the agent decides what to retrieve, when, and with which tool (vector search, SQL, web search), then evaluates whether it has enough to answer and iterates. Interviewers probe how you bound the loop - max iterations, cost budgets, and stop conditions - and how you keep multi-step retrieval from exploding latency and spend.

How do you evaluate a RAG system and what are the LLM-as-judge pitfalls?

Evaluate retrieval (recall@k, precision@k, MRR) separately from generation (faithfulness - is the answer supported by the retrieved context - and answer relevance). Frameworks like RAGAS automate the generation metrics using an LLM as the judge, but judges have known pitfalls: verbosity and position bias, self-preference for their own outputs, and score drift when the judge model changes. Anchor automated scores against a human-labeled subset and pin the judge model version.

RAG Interview Questions (2026): The Complete Guide with 41 Answers

RAG - Retrieval-Augmented Generation - has gone from "cool research technique" to "thing every ML team is building" in about 18 months. If you're interviewing for any role that touches AI/ML, LLMs, or even backend engineering at an AI company, you're going to get RAG questions. Not theoretical ones - practical, "have you actually built this" questions. (Not sure which of those roles you're actually targeting? Our AI engineer vs. ML engineer vs. MLOps breakdown maps how RAG expectations differ by role.)

This guide covers the 41 questions you're most likely to face, organized from fundamentals through production systems - including a new section on what changed in RAG interviews for 2026. Each answer focuses on what interviewers are actually looking for, not textbook definitions. If you want to practice these in a mock interview setting, check out our AI mock interviews to get real-time feedback.

Section 1: RAG Fundamentals

Q1: What is RAG and why does it exist?

RAG is a pattern where you retrieve relevant documents from an external knowledge base and include them in the prompt to an LLM. Instead of relying solely on what the model memorized during training, you give it fresh, specific context at inference time.

It exists because LLMs have three fundamental problems:

Knowledge cutoff - They don't know about anything after their training date
Hallucination - They confidently make things up when they don't know
No private data - They can't answer questions about your company's internal docs

RAG solves all three by grounding the model's responses in actual retrieved documents.

The real interview answer: Don't just define RAG. Explain the problem it solves. Interviewers want to hear that you understand why this pattern exists, not just what it is. Mention the tradeoff - RAG adds latency and complexity, but it gives you control over what the model knows.

Q2: How does a basic RAG pipeline work end-to-end?

A basic RAG pipeline has two phases:

Indexing (offline):

Load documents from your data source
Split them into chunks
Generate embeddings for each chunk
Store chunks + embeddings in a vector database

Query (online):

User asks a question
Generate an embedding for the question
Search the vector database for similar chunks
Stuff the top-k chunks into the LLM prompt as context
LLM generates an answer grounded in those chunks

# Minimal RAG pipeline with LangChain
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Indexing
loader = TextLoader("docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

# Querying
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
answer = qa_chain.invoke("What is our refund policy?")

Q3: RAG vs. fine-tuning - when do you use each?

This is maybe the most common RAG interview question. Here's the decision framework:

Factor	RAG	Fine-tuning
Data changes frequently	Yes	No
Need citations/sources	Yes	Hard
Small dataset (<100 docs)	Yes	Not enough data
Need specific tone/style	Possible	Better
Latency-critical	Slower	Faster at inference
Cost to update	Low (re-index)	High (retrain)
Hallucination control	Better (grounded)	Still hallucinates

The real interview answer: The best answer is "it depends, and often you use both." Fine-tuning teaches the model how to behave (tone, format, reasoning style). RAG teaches it what to know (facts, data, documents). A production system might fine-tune for domain-specific language and use RAG for up-to-date knowledge.

Q4: What are the failure modes of RAG?

This question tests whether you've actually built RAG systems. Common failures:

Retrieval failure - The right document exists but isn't retrieved (bad chunking, bad embeddings, or the query doesn't match the document's language)
Context window overflow - Too many chunks stuffed into the prompt, causing the model to lose focus
Lost in the middle - LLMs pay more attention to the beginning and end of context, ignoring relevant info in the middle
Stale index - Documents were updated but the vector store wasn't re-indexed
Chunk boundary problems - The answer spans two chunks, and only one was retrieved
Wrong abstraction level - User asks a high-level question, retrieval returns low-level implementation details

Q5: What is the "naive RAG" pattern and what are its limitations?

Naive RAG is the simplest implementation: embed query, find top-k similar chunks, stuff them into the prompt. It's what most tutorials teach.

Limitations:

No query understanding - Treats every query the same way regardless of intent
Single retrieval step - One shot to find the right context
No answer validation - No check on whether the retrieved docs actually support the answer
Fixed top-k - Always retrieves the same number of chunks regardless of query complexity
No source diversity - Might return k chunks from the same document section

This is why "advanced RAG" patterns exist - they add query rewriting, reranking, iterative retrieval, and answer validation on top of the basic pipeline.

Section 2: Chunking Strategies

Q6: Why does chunking matter and what happens if you get it wrong?

Chunking is how you break documents into pieces for embedding and retrieval. It matters because:

Too large: Chunks contain too much irrelevant info, diluting the embedding and wasting context window space
Too small: Chunks lose context - a sentence alone might be meaningless without its surrounding paragraph
Bad boundaries: Cutting mid-sentence or mid-thought creates chunks that are hard to embed meaningfully

The embedding model tries to capture the "meaning" of each chunk in a single vector. If your chunk is a grab-bag of unrelated ideas, that vector will be mediocre at representing any of them.

Q7: Explain the main chunking strategies and their tradeoffs.

Fixed-size chunking:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n"
)

Pros: Simple, predictable chunk sizes, fast
Cons: Ignores document structure, cuts mid-thought
Use when: Quick prototyping, uniform text (logs, transcripts)

Recursive character splitting:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

Pros: Tries to respect natural boundaries (paragraphs, sentences)
Cons: Still size-based, not truly semantic
Use when: General-purpose text, good default choice

Semantic chunking:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)

Pros: Groups semantically related content together
Cons: Slower (requires embedding computation), variable chunk sizes
Use when: Documents with mixed topics, Q&A content

Document-aware chunking:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

Pros: Respects document structure, preserves hierarchy
Cons: Requires structured input, chunks can be very uneven in size
Use when: Markdown docs, HTML pages, code files

The real interview answer: Start with recursive character splitting (it's the best default). Move to semantic or document-aware chunking only when you have evidence that chunk quality is hurting retrieval. Mention that chunk size is a hyperparameter you tune based on your embedding model and use case.

Q8: What is chunk overlap and why is it important?

Chunk overlap means consecutive chunks share some text at their boundaries. If chunk 1 is tokens 0-500 and chunk 2 is tokens 450-950, you have 50 tokens of overlap.

Why it matters:

Prevents information loss at chunk boundaries
If an answer spans the boundary between two chunks, the overlap ensures at least one chunk contains the full context
Typical overlap is 10-20% of chunk size

The tradeoff: more overlap means more chunks (more storage, more embedding cost) and potential duplicate retrieval. Too little overlap means boundary information falls through the cracks.

Q9: How do you choose the right chunk size?

There's no universal answer, but here's the framework:

Match your embedding model - Most embedding models have a sweet spot. For models like text-embedding-3-small, chunks of 256-512 tokens work well. Larger models like text-embedding-3-large can handle up to 1024 tokens effectively.
Match your content type:
- Short factual content (FAQs): 100-300 tokens
- Technical documentation: 300-500 tokens
- Long-form narrative: 500-1000 tokens
- Code: Function-level or class-level splitting
Match your query style:
- Specific factual questions: Smaller chunks
- Complex analytical questions: Larger chunks
Empirical testing - The real answer is always "test it." Create a benchmark set of questions with known answers, try different chunk sizes, measure retrieval recall.

# Quick benchmark for chunk size selection
chunk_sizes = [256, 512, 1024]
results = {}

for size in chunk_sizes:
    splitter = RecursiveCharacterTextSplitter(chunk_size=size, chunk_overlap=size // 10)
    chunks = splitter.split_documents(docs)
    vectorstore = FAISS.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    
    hits = 0
    for q, expected_doc in test_set:
        retrieved = retriever.get_relevant_documents(q)
        if any(expected_doc in doc.page_content for doc in retrieved):
            hits += 1
    results[size] = hits / len(test_set)

print(results)  # {256: 0.72, 512: 0.85, 1024: 0.78}

Q10: How would you chunk code differently from prose?

Code has structure that plain text splitters destroy. Good approaches:

AST-based splitting - Parse the code into an abstract syntax tree, split on function/class boundaries
Language-aware splitting - Use tree-sitter or similar to understand code structure
Docstring-enriched chunks - Include the function signature + docstring even if the chunk is the function body

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)

# This understands Python constructs - splits on class/function boundaries
chunks = python_splitter.split_documents(code_docs)

Key insight for interviews: Always preserve the function signature and any imports with the chunk. A chunk containing just a function body with no name or parameters is nearly useless for retrieval.

Section 3: Embeddings

Q11: What are embeddings and how do they enable RAG?

Embeddings are dense vector representations of text that capture semantic meaning. Similar texts produce similar vectors, which lets you find relevant documents even when they don't share exact keywords.

In RAG, embeddings serve as the bridge between natural language queries and stored documents. You embed both your chunks and your query into the same vector space, then find the nearest neighbors.

Key properties:

Dimensionality typically ranges from 384 to 3072
Higher dimensions can capture more nuance but cost more to store and search
The embedding model determines the quality of your entire RAG system - garbage embeddings mean garbage retrieval

Q12: Compare popular embedding models for RAG.

Model	Dimensions	Context	Best For
OpenAI text-embedding-3-small	1536	8191 tokens	General purpose, cost-effective
OpenAI text-embedding-3-large	3072	8191 tokens	High accuracy needs
Cohere embed-v3	1024	512 tokens	Multilingual, search
BGE-large-en-v1.5	1024	512 tokens	Open-source, self-hosted
Voyage-3	1024	32K tokens	Long context, code
Nomic embed-text-v1.5	768	8192 tokens	Open-source, Matryoshka
GTE-Qwen2	1536	32K tokens	Open-source, multilingual

The real interview answer: Don't just list models. Talk about the tradeoffs: API-based vs self-hosted (latency, cost, privacy), dimension size vs accuracy, and the importance of choosing a model that was trained on data similar to your domain. Mention that you'd benchmark on your actual data using MTEB or a custom eval set.

Q13: What similarity metrics are used and when?

Cosine similarity - Measures the angle between vectors. Ranges from -1 to 1. Most common choice for normalized embeddings.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Dot product (inner product) - Faster than cosine if vectors are already normalized (which most embedding models output). If normalized, dot product equals cosine similarity.

Euclidean distance (L2) - Measures straight-line distance. Lower is more similar. Less common for text embeddings but used in some FAISS configurations.

When to use what:

Cosine similarity: Default choice, works with any embeddings
Dot product: When using normalized embeddings (faster)
Euclidean: When magnitude matters (rare for text RAG, more common in recommendation systems)

Q14: What is Matryoshka representation learning and why does it matter for RAG?

Matryoshka embeddings (named after Russian nesting dolls) are trained so that the first N dimensions of the full embedding are themselves a valid, useful embedding. You can truncate a 1536-dim vector to 512 or even 256 dimensions and still get reasonable similarity search results.

Why it matters for RAG:

Cost optimization - Store smaller vectors, use less memory, faster search
Tiered retrieval - Use low-dim embeddings for initial broad search, full-dim for reranking
Flexibility - One model, multiple accuracy/cost tradeoffs

from openai import OpenAI

client = OpenAI()

# Generate full embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is our refund policy?",
    dimensions=512  # Truncate to 512 dims (from 1536)
)
# Still works well for similarity search, but 3x cheaper to store

Q15: How do you handle embedding drift and model updates?

When you upgrade your embedding model, old and new embeddings live in incompatible vector spaces. You can't mix them.

Strategies:

Full re-indexing - Re-embed all documents with the new model. Simple but expensive for large corpora.
Shadow indexing - Build a new index alongside the old one, switch over when ready.
Versioned indices - Keep multiple indices tagged with the model version, query the right one.
Alignment layers - Train a small transformation to map old embeddings to the new space (research-grade, not common in production).

The real interview answer: Full re-indexing is almost always the right answer. The cost of re-embedding is small compared to the cost of running a degraded retrieval system. Mention that you'd automate this as part of your model update pipeline.

Section 4: Vector Databases

Q16: Compare the major vector database options.

Pinecone (managed):

Fully managed, serverless option available
Great for teams that don't want to manage infrastructure
Supports metadata filtering, namespaces
Pricing can be expensive at scale

Weaviate (self-hosted or cloud):

Supports hybrid search (vector + keyword) natively
Built-in module ecosystem (vectorizers, rerankers)
GraphQL API
Good for complex filtering requirements

pgvector (PostgreSQL extension):

Adds vector search to your existing Postgres
No new infrastructure - lives where your data already is
HNSW and IVFFlat index types
Best for teams already on Postgres with moderate scale

FAISS (library, not a database):

Facebook's similarity search library
Extremely fast, runs in-memory
No persistence, no metadata filtering out of the box
Best for prototyping or embedded in a larger system

Qdrant (self-hosted or cloud):

Rust-based, very performant
Rich filtering with payload indexes
Good hybrid search support
Growing rapidly in the ecosystem

Chroma (embedded):

SQLite-based, runs in-process
Perfect for prototyping and small projects
Not designed for production scale
Very easy to get started

# pgvector - using your existing Postgres
import psycopg2

conn = psycopg2.connect("postgresql://localhost/mydb")
cur = conn.cursor()

# Enable the extension
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")

# Create table with vector column
cur.execute("""
    CREATE TABLE documents (
        id SERIAL PRIMARY KEY,
        content TEXT,
        embedding vector(1536),
        metadata JSONB
    )
""")

# Create HNSW index for fast search
cur.execute("""
    CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
""")

# Query
cur.execute("""
    SELECT content, 1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    ORDER BY embedding <=> %s::vector
    LIMIT 5
""", (query_embedding, query_embedding))

The real interview answer: The choice depends on your constraints. If you're already on Postgres and have <10M vectors, pgvector is the no-brainer. If you want zero ops, Pinecone. If you need hybrid search and complex filtering, Weaviate or Qdrant. FAISS is for prototypes. Name your constraints first, then pick the tool.

Q17: What indexing algorithms do vector databases use?

HNSW (Hierarchical Navigable Small World):

Most popular algorithm in production
Builds a multi-layer graph of vectors
O(log n) search time, high recall
Tradeoff: High memory usage (stores the graph structure)
Key params: M (connections per node), ef_construction (build quality), ef_search (search quality)

IVF (Inverted File Index):

Partitions vectors into clusters using k-means
At query time, only searches nearby clusters
Lower memory than HNSW, but lower recall
Key params: nlist (number of clusters), nprobe (clusters to search)

Product Quantization (PQ):

Compresses vectors by splitting into sub-vectors and quantizing each
Dramatic memory reduction (often 10-50x)
Some loss in accuracy
Often combined with IVF (IVF-PQ)

Flat (brute force):

Exact nearest neighbor search
O(n) - checks every vector
Perfect recall, but doesn't scale
Use for small datasets or as a baseline

Q18: How does metadata filtering work in vector databases and why is it important?

Metadata filtering lets you constrain vector search results based on structured attributes. Instead of just "find the 5 most similar vectors," you can say "find the 5 most similar vectors where department='engineering' and date > '2025-01-01'."

Two approaches:

Pre-filtering - Filter first, then search. Fast but can miss good results if the filter is too restrictive.
Post-filtering - Search first, then filter. Better recall but wasteful - you might retrieve 100 results only to throw away 95.

# Pinecone metadata filtering
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "department": {"$eq": "engineering"},
        "date": {"$gte": "2025-01-01"},
        "doc_type": {"$in": ["policy", "handbook"]}
    }
)

Why it matters: In production RAG, you almost always need filtering. Multi-tenant systems need tenant isolation. Document access controls need permission filtering. Time-sensitive data needs date filtering. A vector database without good metadata filtering is a toy.

Q19: How do you handle vector database scaling?

Key scaling dimensions:

Sharding - Distribute vectors across multiple nodes. Most managed services handle this automatically.
Replicas - Multiple copies for read throughput and availability.
Tiered storage - Hot vectors in memory, warm vectors on SSD, cold vectors archived.
Dimensionality reduction - Use Matryoshka embeddings or PCA to shrink vectors.
Quantization - Compress vectors (scalar quantization, product quantization) to reduce memory footprint.

Practical numbers to know:

1M vectors at 1536 dimensions (float32) = ~6 GB raw
HNSW index overhead typically adds 1.5-2x
pgvector is comfortable up to ~5-10M vectors on a single node
Pinecone serverless can handle billions

Section 5: Retrieval Strategies

Q20: What is the difference between dense and sparse retrieval?

Dense retrieval - Uses embedding vectors. Captures semantic meaning. Can find relevant documents even without keyword overlap. This is what most people think of when they hear "RAG."

Sparse retrieval - Uses traditional keyword-based methods like BM25 or TF-IDF. Represents documents as sparse vectors where each dimension corresponds to a word in the vocabulary. Excellent at exact keyword matching.

# Dense retrieval
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
dense_results = vectorstore.similarity_search("Python async patterns", k=5)

# Sparse retrieval with BM25
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 5
sparse_results = bm25_retriever.invoke("Python async patterns")

The key difference: Dense retrieval understands that "automobile" and "car" are the same concept. Sparse retrieval only matches if the exact word appears. But sparse retrieval is better when exact terms matter - product IDs, error codes, proper nouns.

Q21: What is hybrid search and why is it the industry standard?

Hybrid search combines dense (semantic) and sparse (keyword) retrieval, then merges the results. It's the industry standard because neither approach alone is sufficient.

The most common merging strategy is Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(results_lists, k=60):
    """
    Merge multiple ranked result lists using RRF.
    k is a constant that controls how much weight is given to lower-ranked results.
    """
    scores = {}
    for results in results_lists:
        for rank, doc in enumerate(results):
            doc_id = doc.metadata["id"]
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1 / (rank + k)
    
    sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_docs

# Combine dense and sparse results
dense_results = vectorstore.similarity_search(query, k=20)
sparse_results = bm25_retriever.invoke(query)
merged = reciprocal_rank_fusion([dense_results, sparse_results])

Some vector databases support hybrid search natively:

# Weaviate hybrid search
results = client.query.get("Document", ["content"]) \
    .with_hybrid(query="Python async patterns", alpha=0.7) \
    .with_limit(5) \
    .do()
# alpha=1.0 is pure vector, alpha=0.0 is pure keyword

The real interview answer: Always mention hybrid search. It's a red flag if a candidate only talks about vector similarity. Real production systems almost always use hybrid because there are always queries where keywords matter more than semantics.

Q22: What is reranking and how does it improve retrieval?

Reranking is a two-stage retrieval pattern:

First stage - Retrieve a larger candidate set (e.g., top 20-50) using fast but less precise methods
Second stage - Use a more powerful model to re-score and reorder those candidates

The reranker is typically a cross-encoder - it takes the query and each document as a pair and produces a relevance score. This is much more accurate than embedding similarity but too slow to run on your entire corpus.

# Using Cohere reranker
import cohere

co = cohere.Client("your-api-key")

# First stage: retrieve 20 candidates
candidates = vectorstore.similarity_search(query, k=20)

# Second stage: rerank to get top 5
rerank_results = co.rerank(
    query=query,
    documents=[doc.page_content for doc in candidates],
    top_n=5,
    model="rerank-english-v3.0"
)

# Use the reranked results
final_docs = [candidates[r.index] for r in rerank_results.results]

Why it works: Bi-encoders (embedding models) encode query and document independently - they can't model the interaction between them. Cross-encoders see both together, so they can understand nuanced relevance. The tradeoff is speed: cross-encoders are 100-1000x slower per document.

Q23: What is query transformation and why is it useful?

Query transformation modifies the user's query before retrieval to improve results. Types:

Query rewriting - Rephrase the query to better match how information is stored:

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

rewrite_prompt = ChatPromptTemplate.from_template(
    "Rewrite this question to be more specific and search-friendly. "
    "Original: {question}\nRewritten:"
)
rewriter = rewrite_prompt | ChatOpenAI(model="gpt-4o-mini")
better_query = rewriter.invoke({"question": "how do I fix the thing"})

HyDE (Hypothetical Document Embeddings) - Generate a hypothetical answer, then search for documents similar to that answer:

hyde_prompt = ChatPromptTemplate.from_template(
    "Write a short passage that would answer this question: {question}"
)
hypothetical_doc = (hyde_prompt | ChatOpenAI()).invoke({"question": query})
# Embed the hypothetical doc instead of the query
results = vectorstore.similarity_search(hypothetical_doc.content, k=5)

Multi-query - Generate multiple versions of the query and retrieve for each:

from langchain.retrievers.multi_query import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=ChatOpenAI(model="gpt-4o-mini"),
)
# Generates 3 query variants, retrieves for each, deduplicates
results = retriever.invoke("What causes high latency in our API?")

Step-back prompting - Ask a more general question first to get broader context:

Original: "Why did our API latency spike on March 3rd?"
Step-back: "What are common causes of API latency spikes?"

Q24: What is contextual retrieval and how does it improve chunk relevance?

Contextual retrieval (popularized by Anthropic) addresses a core RAG problem: chunks lose context when separated from their document. A chunk saying "The quarterly revenue was $4.2B" is useless without knowing which company and which quarter.

The technique: Before embedding, prepend each chunk with a short context blurb generated by an LLM that explains where this chunk fits in the larger document.

from anthropic import Anthropic

client = Anthropic()

def add_context(chunk_text, full_document):
    """Generate contextual prefix for a chunk."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Here is a document:
<document>
{full_document}
</document>

Here is a chunk from that document:
<chunk>
{chunk_text}
</chunk>

Give a short (2-3 sentence) context that situates this chunk within 
the overall document. Focus on what a reader would need to know to 
understand this chunk in isolation."""
        }]
    )
    context = response.content[0].text
    return f"{context}\n\n{chunk_text}"

# Now embed the contextualized chunk
contextualized_chunk = add_context(chunk, full_doc)
embedding = embed_model.encode(contextualized_chunk)

This typically improves retrieval recall by 20-35% across benchmarks, at the cost of additional LLM calls during indexing (which is offline, so it's usually acceptable).

Section 6: Advanced RAG Patterns

Q25: What is multi-hop RAG and when do you need it?

Multi-hop RAG handles questions that require information from multiple documents that need to be connected through reasoning. A single retrieval step can't answer these.

Example: "Which team lead manages the engineer who wrote the authentication service?"

Hop 1: Find who wrote the authentication service (answer: Alice)
Hop 2: Find Alice's team lead (answer: Bob)

def multi_hop_rag(question, vectorstore, llm, max_hops=3):
    context = []
    current_query = question
    
    for hop in range(max_hops):
        # Retrieve relevant docs for current query
        docs = vectorstore.similarity_search(current_query, k=3)
        context.extend(docs)
        
        # Ask LLM if we have enough info to answer
        check = llm.invoke(
            f"Given this context: {context}\n\n"
            f"Can you answer: {question}\n\n"
            f"If yes, provide the answer. If no, what specific information "
            f"do you still need? Respond with either ANSWER: <answer> or "
            f"NEED: <what you need to search for>"
        )
        
        if check.content.startswith("ANSWER:"):
            return check.content
        else:
            # Extract the next query
            current_query = check.content.replace("NEED:", "").strip()
    
    # Fall back to best-effort answer with collected context
    return llm.invoke(f"Context: {context}\nQuestion: {question}")

Q26: What is self-RAG and how does it improve answer quality?

Self-RAG (Self-Reflective RAG) adds a critique step where the model evaluates its own retrieval and generation. It decides:

Do I need retrieval? - Some questions don't need external context
Are the retrieved docs relevant? - Filter out irrelevant results before generation
Is my answer supported? - Check if the generated answer is grounded in the retrieved docs
Is my answer useful? - Rate the overall quality

def self_rag(question, vectorstore, llm):
    # Step 1: Does this need retrieval?
    need_retrieval = llm.invoke(
        f"Does this question require looking up external information, "
        f"or can it be answered from general knowledge?\n"
        f"Question: {question}\n"
        f"Answer RETRIEVE or GENERATE_ONLY"
    )
    
    if "GENERATE_ONLY" in need_retrieval.content:
        return llm.invoke(question)
    
    # Step 2: Retrieve and filter
    docs = vectorstore.similarity_search(question, k=10)
    
    relevant_docs = []
    for doc in docs:
        relevance = llm.invoke(
            f"Is this document relevant to the question?\n"
            f"Question: {question}\n"
            f"Document: {doc.page_content}\n"
            f"Answer RELEVANT or IRRELEVANT"
        )
        if "RELEVANT" in relevance.content:
            relevant_docs.append(doc)
    
    # Step 3: Generate with relevant docs
    context = "\n".join([d.page_content for d in relevant_docs[:5]])
    answer = llm.invoke(
        f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
    )
    
    # Step 4: Check if answer is supported
    support_check = llm.invoke(
        f"Is this answer fully supported by the provided context?\n"
        f"Context: {context}\n"
        f"Answer: {answer.content}\n"
        f"Respond: FULLY_SUPPORTED, PARTIALLY_SUPPORTED, or NOT_SUPPORTED"
    )
    
    if "NOT_SUPPORTED" in support_check.content:
        return llm.invoke(
            f"Context: {context}\n\n"
            f"Answer this question using ONLY the provided context. "
            f"If the context doesn't contain the answer, say so.\n"
            f"Question: {question}"
        )
    
    return answer

The real interview answer: Self-RAG is about making the system aware of its own quality. The key insight is that not every query needs retrieval, not every retrieved doc is relevant, and not every generated answer is faithful. Adding these checkpoints dramatically reduces hallucination.

Q27: What is Corrective RAG (CRAG)?

Corrective RAG adds a verification and correction step after initial retrieval. If the retrieved documents aren't confident enough, it falls back to web search or other knowledge sources.

The flow:

Retrieve documents from vector store
Grade each document for relevance (Correct / Ambiguous / Incorrect)
If all correct - proceed to generation
If ambiguous - refine the query and re-retrieve
If incorrect - fall back to web search or alternative knowledge base

This is particularly useful when your knowledge base is incomplete. Instead of hallucinating or saying "I don't know," the system actively seeks better sources.

Q28: Explain Graph RAG and when it outperforms vector RAG.

Graph RAG uses a knowledge graph instead of (or alongside) a vector store. It models entities and their relationships explicitly.

When Graph RAG wins:

Multi-hop questions - "What products does the CEO's former company make?" requires traversing relationships
Aggregation queries - "How many engineers report to VP of Engineering?" requires graph traversal
Reasoning about relationships - "Are these two concepts related?" is natural for graphs

# Simplified Graph RAG with Neo4j
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def graph_rag_query(question, llm):
    # Step 1: Extract entities from the question
    entities = llm.invoke(
        f"Extract the key entities from this question: {question}\n"
        f"Return as a comma-separated list."
    )
    
    # Step 2: Query the knowledge graph
    with driver.session() as session:
        # Find relevant subgraph around the entities
        result = session.run("""
            MATCH (n)-[r]->(m)
            WHERE n.name IN $entities OR m.name IN $entities
            RETURN n.name, type(r), m.name
            LIMIT 50
        """, entities=entities.content.split(", "))
        
        triples = [(r["n.name"], r["type(r)"], r["m.name"]) for r in result]
    
    # Step 3: Generate answer using graph context
    context = "\n".join([f"{s} -[{p}]-> {o}" for s, p, o in triples])
    return llm.invoke(
        f"Using these relationships:\n{context}\n\nAnswer: {question}"
    )

The hybrid approach - using both vector search for unstructured text and graph queries for structured relationships - is the current state of the art for complex enterprise RAG.

Q29: What is agentic RAG and how does it differ from standard RAG?

Agentic RAG gives an AI agent control over the retrieval process. Instead of a fixed pipeline (retrieve then generate), the agent decides what to retrieve, when, and how.

Key differences from standard RAG:

Dynamic tool selection - The agent can choose between vector search, SQL queries, API calls, web search
Iterative retrieval - The agent can do multiple retrieval rounds based on what it learns
Query planning - The agent breaks complex questions into sub-queries
Self-correction - The agent recognizes when results are poor and tries different strategies

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.tools import Tool
from langchain_openai import ChatOpenAI

# Define retrieval tools
tools = [
    Tool(
        name="vector_search",
        description="Search technical documentation using semantic similarity",
        func=lambda q: vectorstore.similarity_search(q, k=5)
    ),
    Tool(
        name="sql_query",
        description="Query structured data (metrics, user counts, etc.)",
        func=lambda q: run_sql(q)
    ),
    Tool(
        name="web_search",
        description="Search the web for recent information not in our docs",
        func=lambda q: web_search(q)
    ),
]

agent = create_tool_calling_agent(
    llm=ChatOpenAI(model="gpt-4"),
    tools=tools,
    prompt=agent_prompt,
)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({"input": "Compare our Q1 revenue to the industry average"})
# Agent might: SQL query for Q1 revenue, web search for industry data, then combine

The real interview answer: Agentic RAG is about moving from a pipeline to a loop. The agent reasons about what information it needs, retrieves it, evaluates if it's sufficient, and iterates. This handles complex, multi-faceted questions that a single retrieval step can't address. The tradeoff is latency and cost - more LLM calls, more retrieval operations.

Section 7: Evaluation

Q30: How do you evaluate a RAG system?

RAG evaluation has two distinct parts: retrieval evaluation and generation evaluation. You need both.

Retrieval metrics:

Recall@k - Of the relevant documents, what fraction did you retrieve in the top k?
Precision@k - Of the documents you retrieved, what fraction are relevant?
MRR (Mean Reciprocal Rank) - How high does the first relevant document rank?
NDCG - Normalized Discounted Cumulative Gain - accounts for the position of relevant documents

Generation metrics:

Faithfulness - Is the answer supported by the retrieved context? (No hallucination)
Answer relevance - Does the answer actually address the question?
Completeness - Does the answer cover all aspects of the question?
Correctness - Is the answer factually correct? (Requires ground truth)

# Simple retrieval evaluation
def evaluate_retrieval(test_set, retriever, k=5):
    """
    test_set: list of (query, [relevant_doc_ids])
    """
    recalls = []
    precisions = []
    
    for query, relevant_ids in test_set:
        retrieved = retriever.invoke(query)
        retrieved_ids = [doc.metadata["id"] for doc in retrieved[:k]]
        
        relevant_retrieved = set(retrieved_ids) & set(relevant_ids)
        
        recall = len(relevant_retrieved) / len(relevant_ids) if relevant_ids else 0
        precision = len(relevant_retrieved) / k
        
        recalls.append(recall)
        precisions.append(precision)
    
    return {
        "mean_recall": sum(recalls) / len(recalls),
        "mean_precision": sum(precisions) / len(precisions),
    }

Q31: What is the RAGAS framework?

RAGAS (Retrieval Augmented Generation Assessment) is the most widely used framework for evaluating RAG systems. It provides automated metrics that don't require ground truth answers for most evaluations.

Core RAGAS metrics:

Faithfulness - What fraction of claims in the answer are supported by the context?
Answer Relevancy - How relevant is the answer to the question? (Measured by generating questions from the answer and comparing to the original)
Context Precision - Are the relevant chunks ranked higher in the retrieved results?
Context Recall - Are all the pieces of information needed to answer the question present in the context?

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is our return policy?", "How do I reset my password?"],
    "answer": ["You can return items within 30 days...", "Go to settings..."],
    "contexts": [
        ["Our return policy allows 30-day returns..."],
        ["Password reset is available in user settings..."]
    ],
    "ground_truth": [
        "Items can be returned within 30 days of purchase.",
        "Navigate to Settings > Security > Reset Password."
    ]
}

dataset = Dataset.from_dict(eval_data)

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 
#  'context_precision': 0.85, 'context_recall': 0.90}

The real interview answer: RAGAS is great for automated evaluation, but don't rely on it alone. You also need human evaluation for subjective quality, domain-specific correctness checks, and adversarial testing. Mention that you'd set up a continuous evaluation pipeline - not just a one-time test.

Q32: How do you build a RAG evaluation dataset?

This is a surprisingly practical interview question. Most teams struggle here.

Approach 1: Manual curation

Have domain experts write question-answer pairs
Most accurate but expensive and slow
Aim for 50-100 high-quality pairs covering edge cases

Approach 2: Synthetic generation

Use an LLM to generate questions from your documents
Fast and cheap, but needs human review

def generate_eval_set(documents, llm, n_questions=5):
    eval_pairs = []
    
    for doc in documents:
        response = llm.invoke(
            f"Generate {n_questions} diverse questions that can be answered "
            f"using ONLY the following text. For each question, provide the "
            f"answer and the specific sentence(s) that support it.\n\n"
            f"Text: {doc.page_content}\n\n"
            f"Format each as:\n"
            f"Q: <question>\n"
            f"A: <answer>\n"
            f"Evidence: <supporting text>"
        )
        eval_pairs.append({
            "document": doc,
            "qa_pairs": parse_qa_pairs(response.content)
        })
    
    return eval_pairs

Approach 3: Production logging

Log real user queries and human-rated answers
Most representative of actual usage
Requires a running system and feedback mechanism

The golden rule: Your eval set should include:

Simple factual questions (sanity check)
Questions requiring synthesis across chunks
Questions where the answer doesn't exist in the corpus (should say "I don't know")
Adversarial questions (trying to trick the system)
Questions with multiple valid answers

Section 8: Production RAG

Q33: How do you optimize RAG for production latency?

Production RAG latency breaks down into:

Embedding the query: 50-200ms
Vector search: 10-100ms
Reranking (if used): 200-500ms
LLM generation: 500-3000ms

Optimization strategies:

1. Caching - the first lever here, just as it is in any system design problem:

import hashlib
import redis

r = redis.Redis()

def cached_rag(query, vectorstore, llm, ttl=3600):
    # Cache key based on query
    cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
    
    # Check cache
    cached = r.get(cache_key)
    if cached:
        return cached.decode()
    
    # Full RAG pipeline
    docs = vectorstore.similarity_search(query, k=5)
    context = "\n".join([d.page_content for d in docs])
    answer = llm.invoke(f"Context: {context}\n\nQuestion: {query}")
    
    # Cache the result
    r.setex(cache_key, ttl, answer.content)
    return answer.content

2. Semantic caching - Cache based on query similarity, not exact match:

def semantic_cache_lookup(query, cache_vectorstore, threshold=0.95):
    """Check if a semantically similar query was already answered."""
    results = cache_vectorstore.similarity_search_with_score(query, k=1)
    if results and results[0][1] > threshold:
        return results[0][0].metadata["answer"]
    return None

3. Streaming - Stream the LLM response so users see tokens immediately instead of waiting for the full response.

4. Async retrieval - Run multiple retrieval strategies in parallel:

import asyncio

async def parallel_retrieval(query):
    dense_task = asyncio.create_task(dense_search(query))
    sparse_task = asyncio.create_task(sparse_search(query))
    
    dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
    return merge_results(dense_results, sparse_results)

5. Precomputed answers - For common questions, precompute answers during off-peak hours.

Q34: How do you monitor a RAG system in production?

Key metrics to track:

Retrieval health:

Average similarity score of top-k results (dropping scores = stale index or distribution shift)
Percentage of queries with no results above threshold
Retrieval latency (p50, p95, p99)

Generation quality:

Faithfulness scores (automated, sampled)
User feedback (thumbs up/down, corrections)
Hallucination rate (automated detection)
Token usage per query

System health:

Vector database query latency and error rates
Embedding API availability and latency
LLM API availability, latency, and rate limits
Index freshness (time since last update)

import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class RAGMetrics:
    query: str
    retrieval_latency_ms: float
    generation_latency_ms: float
    num_docs_retrieved: int
    avg_similarity_score: float
    faithfulness_score: Optional[float]
    user_feedback: Optional[str]

def monitored_rag(query, vectorstore, llm):
    # Track retrieval
    start = time.time()
    docs_with_scores = vectorstore.similarity_search_with_score(query, k=5)
    retrieval_time = (time.time() - start) * 1000
    
    docs = [d for d, _ in docs_with_scores]
    scores = [s for _, s in docs_with_scores]
    
    # Track generation
    start = time.time()
    context = "\n".join([d.page_content for d in docs])
    answer = llm.invoke(f"Context: {context}\n\nQuestion: {query}")
    generation_time = (time.time() - start) * 1000
    
    # Log metrics
    metrics = RAGMetrics(
        query=query,
        retrieval_latency_ms=retrieval_time,
        generation_latency_ms=generation_time,
        num_docs_retrieved=len(docs),
        avg_similarity_score=sum(scores) / len(scores),
        faithfulness_score=None,  # Computed async
        user_feedback=None,  # Collected later
    )
    log_metrics(metrics)  # Send to your monitoring system
    
    return answer

The real interview answer: Monitoring RAG is harder than monitoring a normal API because quality is subjective and latent. You won't know about a bad answer until a user complains or you run offline evals. Emphasize the importance of sampling production queries for automated quality scoring, building feedback loops, and setting up alerts on retrieval quality metrics (not just uptime).

Q35: How do you handle document updates and index freshness?

This is a real production challenge that most tutorials skip.

Strategies:

1. Incremental indexing - Only process changed documents:

def incremental_index_update(vectorstore, doc_store):
    """Update index with only changed documents."""
    # Get documents modified since last index update
    last_update = get_last_index_timestamp()
    changed_docs = doc_store.get_modified_since(last_update)
    
    for doc in changed_docs:
        doc_id = doc.metadata["id"]
        
        # Delete old chunks for this document
        vectorstore.delete(filter={"doc_id": doc_id})
        
        # Re-chunk and re-embed
        chunks = splitter.split_documents([doc])
        for chunk in chunks:
            chunk.metadata["doc_id"] = doc_id
        
        vectorstore.add_documents(chunks)
    
    set_last_index_timestamp(time.time())

2. Versioned documents - Keep multiple versions, tag with timestamps:

# Query with freshness preference
results = vectorstore.similarity_search(
    query,
    k=5,
    filter={"updated_at": {"$gte": "2025-01-01"}}
)

3. TTL-based expiry - Automatically remove stale chunks after a set period.

4. Change data capture - Stream database changes to trigger re-indexing in near real-time.

The most common production approach: A scheduled job (every 15 minutes to every few hours depending on your freshness requirements) that does incremental indexing, plus a manual full re-index capability for when you change embedding models or chunking strategies.

Q36: How do you optimize RAG costs in production?

RAG costs come from four places:

Embedding API calls - Every query needs embedding, every document needs embedding at index time
Vector database - Storage, compute, and query costs
Reranker API calls - Cross-encoder calls for reranking
LLM API calls - The generation step (usually the largest cost)

Optimization strategies:

Reduce embedding costs:

Cache query embeddings for repeated/similar queries
Use smaller embedding models for initial retrieval
Batch embedding calls during indexing

Reduce vector DB costs:

Use quantized vectors (reduce storage by 4-8x)
Use Matryoshka embeddings at lower dimensions
Archive old documents to cold storage

Reduce LLM costs:

Use smaller models for simple queries, larger models for complex ones
Compress retrieved context before sending to the LLM
Cache common query-answer pairs
Use prompt compression techniques

def cost_optimized_rag(query, vectorstore, complexity_classifier):
    # Route based on query complexity
    complexity = complexity_classifier.predict(query)
    
    if complexity == "simple":
        # Check semantic cache first
        cached = semantic_cache_lookup(query)
        if cached:
            return cached
        
        # Use smaller model, fewer docs
        docs = vectorstore.similarity_search(query, k=3)
        model = "gpt-4o-mini"
    elif complexity == "medium":
        docs = vectorstore.similarity_search(query, k=5)
        model = "gpt-4o-mini"
    else:  # complex
        # Full pipeline with reranking
        candidates = vectorstore.similarity_search(query, k=20)
        docs = rerank(query, candidates, top_n=5)
        model = "gpt-4o"
    
    context = "\n".join([d.page_content for d in docs])
    return generate(query, context, model=model)

The real interview answer: Cost optimization in RAG is about routing. Not every query needs the full pipeline. A query complexity classifier that routes simple queries through a fast/cheap path and only activates expensive reranking + large models for complex queries can reduce costs by 60-70% with minimal quality impact.

Section 9: What Changed in RAG Interviews for 2026

The questions above cover the pipeline that has been stable for a couple of years. This section covers the questions that entered loops more recently - the ones that separate candidates who have kept up from candidates working off a 2024 tutorial.

Q37: How has agentic RAG changed what "retrieval" means in an interview answer?

The biggest shift: retrieval is increasingly framed as a tool call inside an agent loop, not a fixed pipeline stage. In 2024, "describe your RAG system" meant retrieve-then-generate. In 2026, interviewers often ask how retrieval fits into an agent that decides for itself when to search, what to search, and whether the results are good enough to answer from.

What they're listening for:

The loop, not the pipeline - The agent reasons about what it needs, calls a retrieval tool (vector search, SQL, web search), inspects the result, and either answers or retrieves again with a refined query
Bounded iteration - Max hops, token budgets, and explicit stop conditions. An unbounded agent loop is a cost incident waiting to happen
Tool selection - Why the agent should route a "how many users churned last month" question to SQL instead of the vector store

The real interview answer: Standard RAG is a special case of agentic RAG with exactly one mandatory retrieval step. The agent version buys you multi-step reasoning and tool routing at the price of latency, cost, and harder debugging - so you deploy it only where single-shot retrieval demonstrably fails. If agent loops come up beyond retrieval, our agentic AI interview questions guide goes deeper.

Q38: What are the pitfalls of LLM-as-judge in RAG evaluation?

Automated RAG evaluation (RAGAS-style faithfulness and answer-relevance scoring) almost always uses an LLM as the judge, and interviewers in 2026 expect you to know where that breaks:

Verbosity bias - Judges systematically score longer answers higher, even when the extra length adds nothing
Position bias - In pairwise comparisons, judges favor whichever answer they see first (or last, depending on the model)
Self-preference - A judge model tends to rate outputs from its own model family higher
Score drift - Upgrade the judge model and every historical score becomes incomparable; your dashboard shows a "regression" that is really a rubric change
Faithfulness blind spots - Judges verify that claims appear in the context, but miss subtle unsupported inferences and quantitative errors

The mitigations to name: anchor the judge against a human-labeled subset and re-check agreement periodically, pin the judge model version and re-baseline deliberately when you change it, randomize answer order in pairwise setups, and keep a small set of adversarial cases (answers that are fluent but wrong) in the eval set.

The real interview answer: LLM-as-judge is a scaling tool, not ground truth. The strong answer describes a two-tier system: cheap automated scoring on every sampled production query, calibrated against a smaller human-rated set that catches judge drift.

Q39: GraphRAG vs. vector-only RAG - how do you make the call in 2026?

Q28 covered the mechanics; the 2026 interview version is a judgment question. The framework that lands:

Default to vector + hybrid search. It's cheaper to build, cheaper to maintain, and handles the majority of "find the passage that answers this" workloads
Reach for GraphRAG when the questions are about relationships, not passages - multi-hop questions, aggregations ("how many services depend on X?"), and corpora where entities and their connections carry the meaning (org charts, codebases, supply chains)
Name the hidden cost: the graph itself. Entity and relationship extraction is an LLM-powered indexing pipeline that must be built, evaluated, and kept fresh - a stale graph fails less visibly than a stale vector index

The real interview answer: "Vector-only until I have query logs proving multi-hop failures, then graph augmentation for the entity-heavy slice of the corpus." Committing to GraphRAG everywhere signals resume-driven architecture; refusing it everywhere signals you haven't hit its use cases.

Q40: Long-context models can take a million tokens. Is RAG dead?

The "RAG is dead" question is now a fixture in loops, and it's a cost-reasoning question in disguise.

Why stuffing everything into context doesn't replace retrieval:

Cost scales with input tokens. Shipping a million tokens on every query costs orders of magnitude more than retrieving a few thousand relevant ones - and you pay it again on every follow-up turn
Latency scales too. Prefill on huge contexts adds seconds; users notice
Attention quality degrades. "Lost in the middle" hasn't disappeared; models still weight the beginning and end of context more heavily
You still need to choose what goes in. For any corpus bigger than the window (most enterprise corpora), selection is retrieval by another name

What long context actually changed: chunking precision matters less (you can retrieve generously and let the model sort it out), whole-document retrieval becomes viable, and prompt caching makes a "medium context + retrieval" middle ground cheap. This is the same capacity-vs-cost estimation reasoning that shows up in system design interviews - treat it that way and reason with numbers.

The real interview answer: Long context and RAG are complements. Retrieval decides what's relevant; the window decides how much of it you can afford to include. Bigger windows loosen retrieval's precision requirements but don't remove the selection problem - or the bill.

Q41: What does the default production retrieval stack look like in 2026?

Interviewers increasingly ask this to see if you know the boring, converged answer - and it is converged:

Hybrid retrieval (BM25 + dense embeddings, merged with RRF) as the first stage, top 20-50 candidates
A cross-encoder reranker as the second stage, cutting to the top 5-10
Metadata filtering for tenancy, permissions, and freshness at both stages
Contextualized chunks (contextual retrieval) where budget allows
Sampled automated evaluation (faithfulness + retrieval recall) running continuously against production traffic

The signal isn't naming the stack - it's saying why it converged: hybrid catches the keyword queries dense misses, reranking recovers the precision that fast first-stage retrieval gives up, and evaluation is what tells you when any of it regresses. Candidates who propose exotic architectures before the boring default raise flags. (The same "default first, deviate with evidence" instinct applies across LLM interview questions generally.)

Wrapping Up

RAG interviews in 2026 aren't about definitions - they're about demonstrating that you've built these systems, hit the failure modes, and iterated. The candidates who stand out are the ones who talk about tradeoffs, mention what went wrong, and explain why they chose approach A over approach B.

Key themes interviewers are looking for:

You understand the full pipeline - Not just "embed and search" but chunking, reranking, evaluation, monitoring
You think about tradeoffs - Every design choice has a cost (latency, accuracy, money, complexity)
You've dealt with production reality - Stale indexes, bad chunks, hallucination, cost optimization
You evaluate rigorously - You don't just build and hope; you measure retrieval quality and generation faithfulness

If you want to practice these questions in a realistic interview setting, try our AI mock interviews - they'll probe for depth on exactly these topics. For a structured path through everything an ML engineering loop covers, the ML engineer learning path sequences it end to end, and our machine learning practice questions cover the fundamentals you'll need alongside RAG knowledge. Since nearly every RAG pipeline in this guide is written in Python, it's worth making sure your Python fundamentals are interview-ready too.

Good luck. Go build something.