gitGood.dev
Back to Blog

The Complete Guide to RAG Interview Questions (2026)

P
Patrick Wilson
35 min read

RAG - Retrieval-Augmented Generation - has gone from "cool research technique" to "thing every ML team is building" in about 18 months. If you're interviewing for any role that touches AI/ML, LLMs, or even backend engineering at an AI company, you're going to get RAG questions. Not theoretical ones - practical, "have you actually built this" questions.

This guide covers the 33 questions you're most likely to face, organized from fundamentals through production systems. Each answer focuses on what interviewers are actually looking for, not textbook definitions. If you want to practice these in a mock interview setting, check out our AI mock interviews to get real-time feedback.


Section 1: RAG Fundamentals

Q1: What is RAG and why does it exist?

RAG is a pattern where you retrieve relevant documents from an external knowledge base and include them in the prompt to an LLM. Instead of relying solely on what the model memorized during training, you give it fresh, specific context at inference time.

It exists because LLMs have three fundamental problems:

  • Knowledge cutoff - They don't know about anything after their training date
  • Hallucination - They confidently make things up when they don't know
  • No private data - They can't answer questions about your company's internal docs

RAG solves all three by grounding the model's responses in actual retrieved documents.

The real interview answer: Don't just define RAG. Explain the problem it solves. Interviewers want to hear that you understand why this pattern exists, not just what it is. Mention the tradeoff - RAG adds latency and complexity, but it gives you control over what the model knows.

Q2: How does a basic RAG pipeline work end-to-end?

A basic RAG pipeline has two phases:

Indexing (offline):

  1. Load documents from your data source
  2. Split them into chunks
  3. Generate embeddings for each chunk
  4. Store chunks + embeddings in a vector database

Query (online):

  1. User asks a question
  2. Generate an embedding for the question
  3. Search the vector database for similar chunks
  4. Stuff the top-k chunks into the LLM prompt as context
  5. LLM generates an answer grounded in those chunks
# Minimal RAG pipeline with LangChain
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Indexing
loader = TextLoader("docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

# Querying
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
answer = qa_chain.invoke("What is our refund policy?")

Q3: RAG vs. fine-tuning - when do you use each?

This is maybe the most common RAG interview question. Here's the decision framework:

FactorRAGFine-tuning
Data changes frequentlyYesNo
Need citations/sourcesYesHard
Small dataset (<100 docs)YesNot enough data
Need specific tone/stylePossibleBetter
Latency-criticalSlowerFaster at inference
Cost to updateLow (re-index)High (retrain)
Hallucination controlBetter (grounded)Still hallucinates

The real interview answer: The best answer is "it depends, and often you use both." Fine-tuning teaches the model how to behave (tone, format, reasoning style). RAG teaches it what to know (facts, data, documents). A production system might fine-tune for domain-specific language and use RAG for up-to-date knowledge.

Q4: What are the failure modes of RAG?

This question tests whether you've actually built RAG systems. Common failures:

  1. Retrieval failure - The right document exists but isn't retrieved (bad chunking, bad embeddings, or the query doesn't match the document's language)
  2. Context window overflow - Too many chunks stuffed into the prompt, causing the model to lose focus
  3. Lost in the middle - LLMs pay more attention to the beginning and end of context, ignoring relevant info in the middle
  4. Stale index - Documents were updated but the vector store wasn't re-indexed
  5. Chunk boundary problems - The answer spans two chunks, and only one was retrieved
  6. Wrong abstraction level - User asks a high-level question, retrieval returns low-level implementation details

Q5: What is the "naive RAG" pattern and what are its limitations?

Naive RAG is the simplest implementation: embed query, find top-k similar chunks, stuff them into the prompt. It's what most tutorials teach.

Limitations:

  • No query understanding - Treats every query the same way regardless of intent
  • Single retrieval step - One shot to find the right context
  • No answer validation - No check on whether the retrieved docs actually support the answer
  • Fixed top-k - Always retrieves the same number of chunks regardless of query complexity
  • No source diversity - Might return k chunks from the same document section

This is why "advanced RAG" patterns exist - they add query rewriting, reranking, iterative retrieval, and answer validation on top of the basic pipeline.


Section 2: Chunking Strategies

Q6: Why does chunking matter and what happens if you get it wrong?

Chunking is how you break documents into pieces for embedding and retrieval. It matters because:

  • Too large: Chunks contain too much irrelevant info, diluting the embedding and wasting context window space
  • Too small: Chunks lose context - a sentence alone might be meaningless without its surrounding paragraph
  • Bad boundaries: Cutting mid-sentence or mid-thought creates chunks that are hard to embed meaningfully

The embedding model tries to capture the "meaning" of each chunk in a single vector. If your chunk is a grab-bag of unrelated ideas, that vector will be mediocre at representing any of them.

Q7: Explain the main chunking strategies and their tradeoffs.

Fixed-size chunking:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n"
)
  • Pros: Simple, predictable chunk sizes, fast
  • Cons: Ignores document structure, cuts mid-thought
  • Use when: Quick prototyping, uniform text (logs, transcripts)

Recursive character splitting:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
  • Pros: Tries to respect natural boundaries (paragraphs, sentences)
  • Cons: Still size-based, not truly semantic
  • Use when: General-purpose text, good default choice

Semantic chunking:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)
  • Pros: Groups semantically related content together
  • Cons: Slower (requires embedding computation), variable chunk sizes
  • Use when: Documents with mixed topics, Q&A content

Document-aware chunking:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
  • Pros: Respects document structure, preserves hierarchy
  • Cons: Requires structured input, chunks can be very uneven in size
  • Use when: Markdown docs, HTML pages, code files

The real interview answer: Start with recursive character splitting (it's the best default). Move to semantic or document-aware chunking only when you have evidence that chunk quality is hurting retrieval. Mention that chunk size is a hyperparameter you tune based on your embedding model and use case.

Q8: What is chunk overlap and why is it important?

Chunk overlap means consecutive chunks share some text at their boundaries. If chunk 1 is tokens 0-500 and chunk 2 is tokens 450-950, you have 50 tokens of overlap.

Why it matters:

  • Prevents information loss at chunk boundaries
  • If an answer spans the boundary between two chunks, the overlap ensures at least one chunk contains the full context
  • Typical overlap is 10-20% of chunk size

The tradeoff: more overlap means more chunks (more storage, more embedding cost) and potential duplicate retrieval. Too little overlap means boundary information falls through the cracks.

Q9: How do you choose the right chunk size?

There's no universal answer, but here's the framework:

  1. Match your embedding model - Most embedding models have a sweet spot. For models like text-embedding-3-small, chunks of 256-512 tokens work well. Larger models like text-embedding-3-large can handle up to 1024 tokens effectively.

  2. Match your content type:

    • Short factual content (FAQs): 100-300 tokens
    • Technical documentation: 300-500 tokens
    • Long-form narrative: 500-1000 tokens
    • Code: Function-level or class-level splitting
  3. Match your query style:

    • Specific factual questions: Smaller chunks
    • Complex analytical questions: Larger chunks
  4. Empirical testing - The real answer is always "test it." Create a benchmark set of questions with known answers, try different chunk sizes, measure retrieval recall.

# Quick benchmark for chunk size selection
chunk_sizes = [256, 512, 1024]
results = {}

for size in chunk_sizes:
    splitter = RecursiveCharacterTextSplitter(chunk_size=size, chunk_overlap=size // 10)
    chunks = splitter.split_documents(docs)
    vectorstore = FAISS.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    
    hits = 0
    for q, expected_doc in test_set:
        retrieved = retriever.get_relevant_documents(q)
        if any(expected_doc in doc.page_content for doc in retrieved):
            hits += 1
    results[size] = hits / len(test_set)

print(results)  # {256: 0.72, 512: 0.85, 1024: 0.78}

Q10: How would you chunk code differently from prose?

Code has structure that plain text splitters destroy. Good approaches:

  1. AST-based splitting - Parse the code into an abstract syntax tree, split on function/class boundaries
  2. Language-aware splitting - Use tree-sitter or similar to understand code structure
  3. Docstring-enriched chunks - Include the function signature + docstring even if the chunk is the function body
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)

# This understands Python constructs - splits on class/function boundaries
chunks = python_splitter.split_documents(code_docs)

Key insight for interviews: Always preserve the function signature and any imports with the chunk. A chunk containing just a function body with no name or parameters is nearly useless for retrieval.


Section 3: Embeddings

Q11: What are embeddings and how do they enable RAG?

Embeddings are dense vector representations of text that capture semantic meaning. Similar texts produce similar vectors, which lets you find relevant documents even when they don't share exact keywords.

In RAG, embeddings serve as the bridge between natural language queries and stored documents. You embed both your chunks and your query into the same vector space, then find the nearest neighbors.

Key properties:

  • Dimensionality typically ranges from 384 to 3072
  • Higher dimensions can capture more nuance but cost more to store and search
  • The embedding model determines the quality of your entire RAG system - garbage embeddings mean garbage retrieval
ModelDimensionsContextBest For
OpenAI text-embedding-3-small15368191 tokensGeneral purpose, cost-effective
OpenAI text-embedding-3-large30728191 tokensHigh accuracy needs
Cohere embed-v31024512 tokensMultilingual, search
BGE-large-en-v1.51024512 tokensOpen-source, self-hosted
Voyage-3102432K tokensLong context, code
Nomic embed-text-v1.57688192 tokensOpen-source, Matryoshka
GTE-Qwen2153632K tokensOpen-source, multilingual

The real interview answer: Don't just list models. Talk about the tradeoffs: API-based vs self-hosted (latency, cost, privacy), dimension size vs accuracy, and the importance of choosing a model that was trained on data similar to your domain. Mention that you'd benchmark on your actual data using MTEB or a custom eval set.

Q13: What similarity metrics are used and when?

Cosine similarity - Measures the angle between vectors. Ranges from -1 to 1. Most common choice for normalized embeddings.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Dot product (inner product) - Faster than cosine if vectors are already normalized (which most embedding models output). If normalized, dot product equals cosine similarity.

Euclidean distance (L2) - Measures straight-line distance. Lower is more similar. Less common for text embeddings but used in some FAISS configurations.

When to use what:

  • Cosine similarity: Default choice, works with any embeddings
  • Dot product: When using normalized embeddings (faster)
  • Euclidean: When magnitude matters (rare for text RAG, more common in recommendation systems)

Q14: What is Matryoshka representation learning and why does it matter for RAG?

Matryoshka embeddings (named after Russian nesting dolls) are trained so that the first N dimensions of the full embedding are themselves a valid, useful embedding. You can truncate a 1536-dim vector to 512 or even 256 dimensions and still get reasonable similarity search results.

Why it matters for RAG:

  • Cost optimization - Store smaller vectors, use less memory, faster search
  • Tiered retrieval - Use low-dim embeddings for initial broad search, full-dim for reranking
  • Flexibility - One model, multiple accuracy/cost tradeoffs
from openai import OpenAI

client = OpenAI()

# Generate full embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is our refund policy?",
    dimensions=512  # Truncate to 512 dims (from 1536)
)
# Still works well for similarity search, but 3x cheaper to store

Q15: How do you handle embedding drift and model updates?

When you upgrade your embedding model, old and new embeddings live in incompatible vector spaces. You can't mix them.

Strategies:

  1. Full re-indexing - Re-embed all documents with the new model. Simple but expensive for large corpora.
  2. Shadow indexing - Build a new index alongside the old one, switch over when ready.
  3. Versioned indices - Keep multiple indices tagged with the model version, query the right one.
  4. Alignment layers - Train a small transformation to map old embeddings to the new space (research-grade, not common in production).

The real interview answer: Full re-indexing is almost always the right answer. The cost of re-embedding is small compared to the cost of running a degraded retrieval system. Mention that you'd automate this as part of your model update pipeline.


Section 4: Vector Databases

Q16: Compare the major vector database options.

Pinecone (managed):

  • Fully managed, serverless option available
  • Great for teams that don't want to manage infrastructure
  • Supports metadata filtering, namespaces
  • Pricing can be expensive at scale

Weaviate (self-hosted or cloud):

  • Supports hybrid search (vector + keyword) natively
  • Built-in module ecosystem (vectorizers, rerankers)
  • GraphQL API
  • Good for complex filtering requirements

pgvector (PostgreSQL extension):

  • Adds vector search to your existing Postgres
  • No new infrastructure - lives where your data already is
  • HNSW and IVFFlat index types
  • Best for teams already on Postgres with moderate scale

FAISS (library, not a database):

  • Facebook's similarity search library
  • Extremely fast, runs in-memory
  • No persistence, no metadata filtering out of the box
  • Best for prototyping or embedded in a larger system

Qdrant (self-hosted or cloud):

  • Rust-based, very performant
  • Rich filtering with payload indexes
  • Good hybrid search support
  • Growing rapidly in the ecosystem

Chroma (embedded):

  • SQLite-based, runs in-process
  • Perfect for prototyping and small projects
  • Not designed for production scale
  • Very easy to get started
# pgvector - using your existing Postgres
import psycopg2

conn = psycopg2.connect("postgresql://localhost/mydb")
cur = conn.cursor()

# Enable the extension
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")

# Create table with vector column
cur.execute("""
    CREATE TABLE documents (
        id SERIAL PRIMARY KEY,
        content TEXT,
        embedding vector(1536),
        metadata JSONB
    )
""")

# Create HNSW index for fast search
cur.execute("""
    CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
""")

# Query
cur.execute("""
    SELECT content, 1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    ORDER BY embedding <=> %s::vector
    LIMIT 5
""", (query_embedding, query_embedding))

The real interview answer: The choice depends on your constraints. If you're already on Postgres and have <10M vectors, pgvector is the no-brainer. If you want zero ops, Pinecone. If you need hybrid search and complex filtering, Weaviate or Qdrant. FAISS is for prototypes. Name your constraints first, then pick the tool.

Q17: What indexing algorithms do vector databases use?

HNSW (Hierarchical Navigable Small World):

  • Most popular algorithm in production
  • Builds a multi-layer graph of vectors
  • O(log n) search time, high recall
  • Tradeoff: High memory usage (stores the graph structure)
  • Key params: M (connections per node), ef_construction (build quality), ef_search (search quality)

IVF (Inverted File Index):

  • Partitions vectors into clusters using k-means
  • At query time, only searches nearby clusters
  • Lower memory than HNSW, but lower recall
  • Key params: nlist (number of clusters), nprobe (clusters to search)

Product Quantization (PQ):

  • Compresses vectors by splitting into sub-vectors and quantizing each
  • Dramatic memory reduction (often 10-50x)
  • Some loss in accuracy
  • Often combined with IVF (IVF-PQ)

Flat (brute force):

  • Exact nearest neighbor search
  • O(n) - checks every vector
  • Perfect recall, but doesn't scale
  • Use for small datasets or as a baseline

Q18: How does metadata filtering work in vector databases and why is it important?

Metadata filtering lets you constrain vector search results based on structured attributes. Instead of just "find the 5 most similar vectors," you can say "find the 5 most similar vectors where department='engineering' and date > '2025-01-01'."

Two approaches:

  1. Pre-filtering - Filter first, then search. Fast but can miss good results if the filter is too restrictive.
  2. Post-filtering - Search first, then filter. Better recall but wasteful - you might retrieve 100 results only to throw away 95.
# Pinecone metadata filtering
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "department": {"$eq": "engineering"},
        "date": {"$gte": "2025-01-01"},
        "doc_type": {"$in": ["policy", "handbook"]}
    }
)

Why it matters: In production RAG, you almost always need filtering. Multi-tenant systems need tenant isolation. Document access controls need permission filtering. Time-sensitive data needs date filtering. A vector database without good metadata filtering is a toy.

Q19: How do you handle vector database scaling?

Key scaling dimensions:

  1. Sharding - Distribute vectors across multiple nodes. Most managed services handle this automatically.
  2. Replicas - Multiple copies for read throughput and availability.
  3. Tiered storage - Hot vectors in memory, warm vectors on SSD, cold vectors archived.
  4. Dimensionality reduction - Use Matryoshka embeddings or PCA to shrink vectors.
  5. Quantization - Compress vectors (scalar quantization, product quantization) to reduce memory footprint.

Practical numbers to know:

  • 1M vectors at 1536 dimensions (float32) = ~6 GB raw
  • HNSW index overhead typically adds 1.5-2x
  • pgvector is comfortable up to ~5-10M vectors on a single node
  • Pinecone serverless can handle billions

Section 5: Retrieval Strategies

Q20: What is the difference between dense and sparse retrieval?

Dense retrieval - Uses embedding vectors. Captures semantic meaning. Can find relevant documents even without keyword overlap. This is what most people think of when they hear "RAG."

Sparse retrieval - Uses traditional keyword-based methods like BM25 or TF-IDF. Represents documents as sparse vectors where each dimension corresponds to a word in the vocabulary. Excellent at exact keyword matching.

# Dense retrieval
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
dense_results = vectorstore.similarity_search("Python async patterns", k=5)

# Sparse retrieval with BM25
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 5
sparse_results = bm25_retriever.invoke("Python async patterns")

The key difference: Dense retrieval understands that "automobile" and "car" are the same concept. Sparse retrieval only matches if the exact word appears. But sparse retrieval is better when exact terms matter - product IDs, error codes, proper nouns.

Q21: What is hybrid search and why is it the industry standard?

Hybrid search combines dense (semantic) and sparse (keyword) retrieval, then merges the results. It's the industry standard because neither approach alone is sufficient.

The most common merging strategy is Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(results_lists, k=60):
    """
    Merge multiple ranked result lists using RRF.
    k is a constant that controls how much weight is given to lower-ranked results.
    """
    scores = {}
    for results in results_lists:
        for rank, doc in enumerate(results):
            doc_id = doc.metadata["id"]
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1 / (rank + k)
    
    sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_docs

# Combine dense and sparse results
dense_results = vectorstore.similarity_search(query, k=20)
sparse_results = bm25_retriever.invoke(query)
merged = reciprocal_rank_fusion([dense_results, sparse_results])

Some vector databases support hybrid search natively:

# Weaviate hybrid search
results = client.query.get("Document", ["content"]) \
    .with_hybrid(query="Python async patterns", alpha=0.7) \
    .with_limit(5) \
    .do()
# alpha=1.0 is pure vector, alpha=0.0 is pure keyword

The real interview answer: Always mention hybrid search. It's a red flag if a candidate only talks about vector similarity. Real production systems almost always use hybrid because there are always queries where keywords matter more than semantics.

Q22: What is reranking and how does it improve retrieval?

Reranking is a two-stage retrieval pattern:

  1. First stage - Retrieve a larger candidate set (e.g., top 20-50) using fast but less precise methods
  2. Second stage - Use a more powerful model to re-score and reorder those candidates

The reranker is typically a cross-encoder - it takes the query and each document as a pair and produces a relevance score. This is much more accurate than embedding similarity but too slow to run on your entire corpus.

# Using Cohere reranker
import cohere

co = cohere.Client("your-api-key")

# First stage: retrieve 20 candidates
candidates = vectorstore.similarity_search(query, k=20)

# Second stage: rerank to get top 5
rerank_results = co.rerank(
    query=query,
    documents=[doc.page_content for doc in candidates],
    top_n=5,
    model="rerank-english-v3.0"
)

# Use the reranked results
final_docs = [candidates[r.index] for r in rerank_results.results]

Why it works: Bi-encoders (embedding models) encode query and document independently - they can't model the interaction between them. Cross-encoders see both together, so they can understand nuanced relevance. The tradeoff is speed: cross-encoders are 100-1000x slower per document.

Q23: What is query transformation and why is it useful?

Query transformation modifies the user's query before retrieval to improve results. Types:

Query rewriting - Rephrase the query to better match how information is stored:

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

rewrite_prompt = ChatPromptTemplate.from_template(
    "Rewrite this question to be more specific and search-friendly. "
    "Original: {question}\nRewritten:"
)
rewriter = rewrite_prompt | ChatOpenAI(model="gpt-4o-mini")
better_query = rewriter.invoke({"question": "how do I fix the thing"})

HyDE (Hypothetical Document Embeddings) - Generate a hypothetical answer, then search for documents similar to that answer:

hyde_prompt = ChatPromptTemplate.from_template(
    "Write a short passage that would answer this question: {question}"
)
hypothetical_doc = (hyde_prompt | ChatOpenAI()).invoke({"question": query})
# Embed the hypothetical doc instead of the query
results = vectorstore.similarity_search(hypothetical_doc.content, k=5)

Multi-query - Generate multiple versions of the query and retrieve for each:

from langchain.retrievers.multi_query import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=ChatOpenAI(model="gpt-4o-mini"),
)
# Generates 3 query variants, retrieves for each, deduplicates
results = retriever.invoke("What causes high latency in our API?")

Step-back prompting - Ask a more general question first to get broader context:

  • Original: "Why did our API latency spike on March 3rd?"
  • Step-back: "What are common causes of API latency spikes?"

Q24: What is contextual retrieval and how does it improve chunk relevance?

Contextual retrieval (popularized by Anthropic) addresses a core RAG problem: chunks lose context when separated from their document. A chunk saying "The quarterly revenue was $4.2B" is useless without knowing which company and which quarter.

The technique: Before embedding, prepend each chunk with a short context blurb generated by an LLM that explains where this chunk fits in the larger document.

from anthropic import Anthropic

client = Anthropic()

def add_context(chunk_text, full_document):
    """Generate contextual prefix for a chunk."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Here is a document:
<document>
{full_document}
</document>

Here is a chunk from that document:
<chunk>
{chunk_text}
</chunk>

Give a short (2-3 sentence) context that situates this chunk within 
the overall document. Focus on what a reader would need to know to 
understand this chunk in isolation."""
        }]
    )
    context = response.content[0].text
    return f"{context}\n\n{chunk_text}"

# Now embed the contextualized chunk
contextualized_chunk = add_context(chunk, full_doc)
embedding = embed_model.encode(contextualized_chunk)

This typically improves retrieval recall by 20-35% across benchmarks, at the cost of additional LLM calls during indexing (which is offline, so it's usually acceptable).


Section 6: Advanced RAG Patterns

Q25: What is multi-hop RAG and when do you need it?

Multi-hop RAG handles questions that require information from multiple documents that need to be connected through reasoning. A single retrieval step can't answer these.

Example: "Which team lead manages the engineer who wrote the authentication service?"

  • Hop 1: Find who wrote the authentication service (answer: Alice)
  • Hop 2: Find Alice's team lead (answer: Bob)
def multi_hop_rag(question, vectorstore, llm, max_hops=3):
    context = []
    current_query = question
    
    for hop in range(max_hops):
        # Retrieve relevant docs for current query
        docs = vectorstore.similarity_search(current_query, k=3)
        context.extend(docs)
        
        # Ask LLM if we have enough info to answer
        check = llm.invoke(
            f"Given this context: {context}\n\n"
            f"Can you answer: {question}\n\n"
            f"If yes, provide the answer. If no, what specific information "
            f"do you still need? Respond with either ANSWER: <answer> or "
            f"NEED: <what you need to search for>"
        )
        
        if check.content.startswith("ANSWER:"):
            return check.content
        else:
            # Extract the next query
            current_query = check.content.replace("NEED:", "").strip()
    
    # Fall back to best-effort answer with collected context
    return llm.invoke(f"Context: {context}\nQuestion: {question}")

Q26: What is self-RAG and how does it improve answer quality?

Self-RAG (Self-Reflective RAG) adds a critique step where the model evaluates its own retrieval and generation. It decides:

  1. Do I need retrieval? - Some questions don't need external context
  2. Are the retrieved docs relevant? - Filter out irrelevant results before generation
  3. Is my answer supported? - Check if the generated answer is grounded in the retrieved docs
  4. Is my answer useful? - Rate the overall quality
def self_rag(question, vectorstore, llm):
    # Step 1: Does this need retrieval?
    need_retrieval = llm.invoke(
        f"Does this question require looking up external information, "
        f"or can it be answered from general knowledge?\n"
        f"Question: {question}\n"
        f"Answer RETRIEVE or GENERATE_ONLY"
    )
    
    if "GENERATE_ONLY" in need_retrieval.content:
        return llm.invoke(question)
    
    # Step 2: Retrieve and filter
    docs = vectorstore.similarity_search(question, k=10)
    
    relevant_docs = []
    for doc in docs:
        relevance = llm.invoke(
            f"Is this document relevant to the question?\n"
            f"Question: {question}\n"
            f"Document: {doc.page_content}\n"
            f"Answer RELEVANT or IRRELEVANT"
        )
        if "RELEVANT" in relevance.content:
            relevant_docs.append(doc)
    
    # Step 3: Generate with relevant docs
    context = "\n".join([d.page_content for d in relevant_docs[:5]])
    answer = llm.invoke(
        f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
    )
    
    # Step 4: Check if answer is supported
    support_check = llm.invoke(
        f"Is this answer fully supported by the provided context?\n"
        f"Context: {context}\n"
        f"Answer: {answer.content}\n"
        f"Respond: FULLY_SUPPORTED, PARTIALLY_SUPPORTED, or NOT_SUPPORTED"
    )
    
    if "NOT_SUPPORTED" in support_check.content:
        return llm.invoke(
            f"Context: {context}\n\n"
            f"Answer this question using ONLY the provided context. "
            f"If the context doesn't contain the answer, say so.\n"
            f"Question: {question}"
        )
    
    return answer

The real interview answer: Self-RAG is about making the system aware of its own quality. The key insight is that not every query needs retrieval, not every retrieved doc is relevant, and not every generated answer is faithful. Adding these checkpoints dramatically reduces hallucination.

Q27: What is Corrective RAG (CRAG)?

Corrective RAG adds a verification and correction step after initial retrieval. If the retrieved documents aren't confident enough, it falls back to web search or other knowledge sources.

The flow:

  1. Retrieve documents from vector store
  2. Grade each document for relevance (Correct / Ambiguous / Incorrect)
  3. If all correct - proceed to generation
  4. If ambiguous - refine the query and re-retrieve
  5. If incorrect - fall back to web search or alternative knowledge base

This is particularly useful when your knowledge base is incomplete. Instead of hallucinating or saying "I don't know," the system actively seeks better sources.

Q28: Explain Graph RAG and when it outperforms vector RAG.

Graph RAG uses a knowledge graph instead of (or alongside) a vector store. It models entities and their relationships explicitly.

When Graph RAG wins:

  • Multi-hop questions - "What products does the CEO's former company make?" requires traversing relationships
  • Aggregation queries - "How many engineers report to VP of Engineering?" requires graph traversal
  • Reasoning about relationships - "Are these two concepts related?" is natural for graphs
# Simplified Graph RAG with Neo4j
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def graph_rag_query(question, llm):
    # Step 1: Extract entities from the question
    entities = llm.invoke(
        f"Extract the key entities from this question: {question}\n"
        f"Return as a comma-separated list."
    )
    
    # Step 2: Query the knowledge graph
    with driver.session() as session:
        # Find relevant subgraph around the entities
        result = session.run("""
            MATCH (n)-[r]->(m)
            WHERE n.name IN $entities OR m.name IN $entities
            RETURN n.name, type(r), m.name
            LIMIT 50
        """, entities=entities.content.split(", "))
        
        triples = [(r["n.name"], r["type(r)"], r["m.name"]) for r in result]
    
    # Step 3: Generate answer using graph context
    context = "\n".join([f"{s} -[{p}]-> {o}" for s, p, o in triples])
    return llm.invoke(
        f"Using these relationships:\n{context}\n\nAnswer: {question}"
    )

The hybrid approach - using both vector search for unstructured text and graph queries for structured relationships - is the current state of the art for complex enterprise RAG.

Q29: What is agentic RAG and how does it differ from standard RAG?

Agentic RAG gives an AI agent control over the retrieval process. Instead of a fixed pipeline (retrieve then generate), the agent decides what to retrieve, when, and how.

Key differences from standard RAG:

  • Dynamic tool selection - The agent can choose between vector search, SQL queries, API calls, web search
  • Iterative retrieval - The agent can do multiple retrieval rounds based on what it learns
  • Query planning - The agent breaks complex questions into sub-queries
  • Self-correction - The agent recognizes when results are poor and tries different strategies
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.tools import Tool
from langchain_openai import ChatOpenAI

# Define retrieval tools
tools = [
    Tool(
        name="vector_search",
        description="Search technical documentation using semantic similarity",
        func=lambda q: vectorstore.similarity_search(q, k=5)
    ),
    Tool(
        name="sql_query",
        description="Query structured data (metrics, user counts, etc.)",
        func=lambda q: run_sql(q)
    ),
    Tool(
        name="web_search",
        description="Search the web for recent information not in our docs",
        func=lambda q: web_search(q)
    ),
]

agent = create_tool_calling_agent(
    llm=ChatOpenAI(model="gpt-4"),
    tools=tools,
    prompt=agent_prompt,
)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({"input": "Compare our Q1 revenue to the industry average"})
# Agent might: SQL query for Q1 revenue, web search for industry data, then combine

The real interview answer: Agentic RAG is about moving from a pipeline to a loop. The agent reasons about what information it needs, retrieves it, evaluates if it's sufficient, and iterates. This handles complex, multi-faceted questions that a single retrieval step can't address. The tradeoff is latency and cost - more LLM calls, more retrieval operations.


Section 7: Evaluation

Q30: How do you evaluate a RAG system?

RAG evaluation has two distinct parts: retrieval evaluation and generation evaluation. You need both.

Retrieval metrics:

  • Recall@k - Of the relevant documents, what fraction did you retrieve in the top k?
  • Precision@k - Of the documents you retrieved, what fraction are relevant?
  • MRR (Mean Reciprocal Rank) - How high does the first relevant document rank?
  • NDCG - Normalized Discounted Cumulative Gain - accounts for the position of relevant documents

Generation metrics:

  • Faithfulness - Is the answer supported by the retrieved context? (No hallucination)
  • Answer relevance - Does the answer actually address the question?
  • Completeness - Does the answer cover all aspects of the question?
  • Correctness - Is the answer factually correct? (Requires ground truth)
# Simple retrieval evaluation
def evaluate_retrieval(test_set, retriever, k=5):
    """
    test_set: list of (query, [relevant_doc_ids])
    """
    recalls = []
    precisions = []
    
    for query, relevant_ids in test_set:
        retrieved = retriever.invoke(query)
        retrieved_ids = [doc.metadata["id"] for doc in retrieved[:k]]
        
        relevant_retrieved = set(retrieved_ids) & set(relevant_ids)
        
        recall = len(relevant_retrieved) / len(relevant_ids) if relevant_ids else 0
        precision = len(relevant_retrieved) / k
        
        recalls.append(recall)
        precisions.append(precision)
    
    return {
        "mean_recall": sum(recalls) / len(recalls),
        "mean_precision": sum(precisions) / len(precisions),
    }

Q31: What is the RAGAS framework?

RAGAS (Retrieval Augmented Generation Assessment) is the most widely used framework for evaluating RAG systems. It provides automated metrics that don't require ground truth answers for most evaluations.

Core RAGAS metrics:

  1. Faithfulness - What fraction of claims in the answer are supported by the context?
  2. Answer Relevancy - How relevant is the answer to the question? (Measured by generating questions from the answer and comparing to the original)
  3. Context Precision - Are the relevant chunks ranked higher in the retrieved results?
  4. Context Recall - Are all the pieces of information needed to answer the question present in the context?
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is our return policy?", "How do I reset my password?"],
    "answer": ["You can return items within 30 days...", "Go to settings..."],
    "contexts": [
        ["Our return policy allows 30-day returns..."],
        ["Password reset is available in user settings..."]
    ],
    "ground_truth": [
        "Items can be returned within 30 days of purchase.",
        "Navigate to Settings > Security > Reset Password."
    ]
}

dataset = Dataset.from_dict(eval_data)

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 
#  'context_precision': 0.85, 'context_recall': 0.90}

The real interview answer: RAGAS is great for automated evaluation, but don't rely on it alone. You also need human evaluation for subjective quality, domain-specific correctness checks, and adversarial testing. Mention that you'd set up a continuous evaluation pipeline - not just a one-time test.

Q32: How do you build a RAG evaluation dataset?

This is a surprisingly practical interview question. Most teams struggle here.

Approach 1: Manual curation

  • Have domain experts write question-answer pairs
  • Most accurate but expensive and slow
  • Aim for 50-100 high-quality pairs covering edge cases

Approach 2: Synthetic generation

  • Use an LLM to generate questions from your documents
  • Fast and cheap, but needs human review
def generate_eval_set(documents, llm, n_questions=5):
    eval_pairs = []
    
    for doc in documents:
        response = llm.invoke(
            f"Generate {n_questions} diverse questions that can be answered "
            f"using ONLY the following text. For each question, provide the "
            f"answer and the specific sentence(s) that support it.\n\n"
            f"Text: {doc.page_content}\n\n"
            f"Format each as:\n"
            f"Q: <question>\n"
            f"A: <answer>\n"
            f"Evidence: <supporting text>"
        )
        eval_pairs.append({
            "document": doc,
            "qa_pairs": parse_qa_pairs(response.content)
        })
    
    return eval_pairs

Approach 3: Production logging

  • Log real user queries and human-rated answers
  • Most representative of actual usage
  • Requires a running system and feedback mechanism

The golden rule: Your eval set should include:

  • Simple factual questions (sanity check)
  • Questions requiring synthesis across chunks
  • Questions where the answer doesn't exist in the corpus (should say "I don't know")
  • Adversarial questions (trying to trick the system)
  • Questions with multiple valid answers

Section 8: Production RAG

Q33: How do you optimize RAG for production latency?

Production RAG latency breaks down into:

  • Embedding the query: 50-200ms
  • Vector search: 10-100ms
  • Reranking (if used): 200-500ms
  • LLM generation: 500-3000ms

Optimization strategies:

1. Caching

import hashlib
import redis

r = redis.Redis()

def cached_rag(query, vectorstore, llm, ttl=3600):
    # Cache key based on query
    cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
    
    # Check cache
    cached = r.get(cache_key)
    if cached:
        return cached.decode()
    
    # Full RAG pipeline
    docs = vectorstore.similarity_search(query, k=5)
    context = "\n".join([d.page_content for d in docs])
    answer = llm.invoke(f"Context: {context}\n\nQuestion: {query}")
    
    # Cache the result
    r.setex(cache_key, ttl, answer.content)
    return answer.content

2. Semantic caching - Cache based on query similarity, not exact match:

def semantic_cache_lookup(query, cache_vectorstore, threshold=0.95):
    """Check if a semantically similar query was already answered."""
    results = cache_vectorstore.similarity_search_with_score(query, k=1)
    if results and results[0][1] > threshold:
        return results[0][0].metadata["answer"]
    return None

3. Streaming - Stream the LLM response so users see tokens immediately instead of waiting for the full response.

4. Async retrieval - Run multiple retrieval strategies in parallel:

import asyncio

async def parallel_retrieval(query):
    dense_task = asyncio.create_task(dense_search(query))
    sparse_task = asyncio.create_task(sparse_search(query))
    
    dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
    return merge_results(dense_results, sparse_results)

5. Precomputed answers - For common questions, precompute answers during off-peak hours.

Q34: How do you monitor a RAG system in production?

Key metrics to track:

Retrieval health:

  • Average similarity score of top-k results (dropping scores = stale index or distribution shift)
  • Percentage of queries with no results above threshold
  • Retrieval latency (p50, p95, p99)

Generation quality:

  • Faithfulness scores (automated, sampled)
  • User feedback (thumbs up/down, corrections)
  • Hallucination rate (automated detection)
  • Token usage per query

System health:

  • Vector database query latency and error rates
  • Embedding API availability and latency
  • LLM API availability, latency, and rate limits
  • Index freshness (time since last update)
import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class RAGMetrics:
    query: str
    retrieval_latency_ms: float
    generation_latency_ms: float
    num_docs_retrieved: int
    avg_similarity_score: float
    faithfulness_score: Optional[float]
    user_feedback: Optional[str]

def monitored_rag(query, vectorstore, llm):
    # Track retrieval
    start = time.time()
    docs_with_scores = vectorstore.similarity_search_with_score(query, k=5)
    retrieval_time = (time.time() - start) * 1000
    
    docs = [d for d, _ in docs_with_scores]
    scores = [s for _, s in docs_with_scores]
    
    # Track generation
    start = time.time()
    context = "\n".join([d.page_content for d in docs])
    answer = llm.invoke(f"Context: {context}\n\nQuestion: {query}")
    generation_time = (time.time() - start) * 1000
    
    # Log metrics
    metrics = RAGMetrics(
        query=query,
        retrieval_latency_ms=retrieval_time,
        generation_latency_ms=generation_time,
        num_docs_retrieved=len(docs),
        avg_similarity_score=sum(scores) / len(scores),
        faithfulness_score=None,  # Computed async
        user_feedback=None,  # Collected later
    )
    log_metrics(metrics)  # Send to your monitoring system
    
    return answer

The real interview answer: Monitoring RAG is harder than monitoring a normal API because quality is subjective and latent. You won't know about a bad answer until a user complains or you run offline evals. Emphasize the importance of sampling production queries for automated quality scoring, building feedback loops, and setting up alerts on retrieval quality metrics (not just uptime).

Q35: How do you handle document updates and index freshness?

This is a real production challenge that most tutorials skip.

Strategies:

1. Incremental indexing - Only process changed documents:

def incremental_index_update(vectorstore, doc_store):
    """Update index with only changed documents."""
    # Get documents modified since last index update
    last_update = get_last_index_timestamp()
    changed_docs = doc_store.get_modified_since(last_update)
    
    for doc in changed_docs:
        doc_id = doc.metadata["id"]
        
        # Delete old chunks for this document
        vectorstore.delete(filter={"doc_id": doc_id})
        
        # Re-chunk and re-embed
        chunks = splitter.split_documents([doc])
        for chunk in chunks:
            chunk.metadata["doc_id"] = doc_id
        
        vectorstore.add_documents(chunks)
    
    set_last_index_timestamp(time.time())

2. Versioned documents - Keep multiple versions, tag with timestamps:

# Query with freshness preference
results = vectorstore.similarity_search(
    query,
    k=5,
    filter={"updated_at": {"$gte": "2025-01-01"}}
)

3. TTL-based expiry - Automatically remove stale chunks after a set period.

4. Change data capture - Stream database changes to trigger re-indexing in near real-time.

The most common production approach: A scheduled job (every 15 minutes to every few hours depending on your freshness requirements) that does incremental indexing, plus a manual full re-index capability for when you change embedding models or chunking strategies.

Q36: How do you optimize RAG costs in production?

RAG costs come from four places:

  1. Embedding API calls - Every query needs embedding, every document needs embedding at index time
  2. Vector database - Storage, compute, and query costs
  3. Reranker API calls - Cross-encoder calls for reranking
  4. LLM API calls - The generation step (usually the largest cost)

Optimization strategies:

Reduce embedding costs:

  • Cache query embeddings for repeated/similar queries
  • Use smaller embedding models for initial retrieval
  • Batch embedding calls during indexing

Reduce vector DB costs:

  • Use quantized vectors (reduce storage by 4-8x)
  • Use Matryoshka embeddings at lower dimensions
  • Archive old documents to cold storage

Reduce LLM costs:

  • Use smaller models for simple queries, larger models for complex ones
  • Compress retrieved context before sending to the LLM
  • Cache common query-answer pairs
  • Use prompt compression techniques
def cost_optimized_rag(query, vectorstore, complexity_classifier):
    # Route based on query complexity
    complexity = complexity_classifier.predict(query)
    
    if complexity == "simple":
        # Check semantic cache first
        cached = semantic_cache_lookup(query)
        if cached:
            return cached
        
        # Use smaller model, fewer docs
        docs = vectorstore.similarity_search(query, k=3)
        model = "gpt-4o-mini"
    elif complexity == "medium":
        docs = vectorstore.similarity_search(query, k=5)
        model = "gpt-4o-mini"
    else:  # complex
        # Full pipeline with reranking
        candidates = vectorstore.similarity_search(query, k=20)
        docs = rerank(query, candidates, top_n=5)
        model = "gpt-4o"
    
    context = "\n".join([d.page_content for d in docs])
    return generate(query, context, model=model)

The real interview answer: Cost optimization in RAG is about routing. Not every query needs the full pipeline. A query complexity classifier that routes simple queries through a fast/cheap path and only activates expensive reranking + large models for complex queries can reduce costs by 60-70% with minimal quality impact.


Wrapping Up

RAG interviews in 2026 aren't about definitions - they're about demonstrating that you've built these systems, hit the failure modes, and iterated. The candidates who stand out are the ones who talk about tradeoffs, mention what went wrong, and explain why they chose approach A over approach B.

Key themes interviewers are looking for:

  • You understand the full pipeline - Not just "embed and search" but chunking, reranking, evaluation, monitoring
  • You think about tradeoffs - Every design choice has a cost (latency, accuracy, money, complexity)
  • You've dealt with production reality - Stale indexes, bad chunks, hallucination, cost optimization
  • You evaluate rigorously - You don't just build and hope; you measure retrieval quality and generation faithfulness

If you want to practice these questions in a realistic interview setting, try our AI mock interviews - they'll probe for depth on exactly these topics. And if you're studying for a broader ML engineering interview, our machine learning interview prep covers the fundamentals you'll need alongside RAG knowledge.

Good luck. Go build something.