MLOps used to be a niche role hidden behind ML research teams. In 2026 it is one of the most in-demand job titles in engineering, and the interviews have changed to reflect that. Companies are not hiring MLOps engineers to set up Jenkins for training pipelines anymore. They are hiring people who can operate production AI systems with the same rigor as any other distributed system, and the interviews test for that.

The questions below are pulled from MLOps and ML platform engineering interviews at FAANG, large AI-first startups, and enterprises building serious AI infrastructure. They are organized from fundamentals through production, with answers that reflect what strong candidates actually say - not textbook summaries.

Part 1: Fundamentals

1. What is MLOps and how is it different from DevOps?

MLOps is the discipline of operating ML systems in production. It overlaps heavily with DevOps but adds three dimensions: data versioning (your model depends on data, not just code), model evaluation (correctness is probabilistic, not binary), and drift (a working system can silently degrade over time without any code change). If you can explain those three and give examples, you have answered this well.

2. What are the stages of an ML system lifecycle?

Data ingestion, feature engineering, training, evaluation, deployment, monitoring, retraining. The important part is not naming the stages - it is recognizing that each one has its own tooling, failure modes, and handoffs. Interviewers will probe where you have actually owned each stage.

3. What is the "ML system is 5% code, 95% everything else" idea?

A Google paper made this famous. The point: the model itself is a small fraction of what it takes to run ML in production. The other 95% is data pipelines, feature stores, serving infrastructure, monitoring, evaluation, and configuration. Interviewers ask this to see if you understand where your job actually is.

4. What is data drift vs. concept drift?

Data drift: the input distribution changes (new user demographics, seasonality, a source system starts producing different data). Concept drift: the relationship between input and output changes (user behavior shifts, fraud patterns evolve). Both degrade model quality silently. Detecting them requires monitoring that looks at inputs and outputs separately.

5. What is the difference between training-serving skew and data drift?

Training-serving skew is when the data your model sees in production is systematically different from what it saw in training, because of a bug in your feature pipeline or a difference between training and serving code paths. Data drift is the world changing over time. Skew is a bug. Drift is a fact of life. Both require monitoring, but the fixes are different.

Part 2: Feature Stores and Data

6. What problem does a feature store solve?

Three problems: consistency between training and serving (same feature code in both paths), reuse across teams (stop reinventing features), and point-in-time correctness for training (no future data leakage). If your answer is just "caching" you are missing most of it.

7. Explain point-in-time correctness.

When building a training set, for each example you need features as they would have been at prediction time, not as they are now. If a user's "total_purchases" feature today is 50 but was 3 at the time of the historical prediction, you need 3 in your training data. Getting this wrong causes data leakage that inflates offline metrics and collapses in production. Feature stores exist in large part to make this easier.

8. Compare online and offline feature stores.

Online serves features at low latency for inference - typically Redis, DynamoDB, or a custom key-value store. Offline stores large historical feature tables for training - typically in a data warehouse or object storage. The hard part is keeping them consistent: the same feature logic must populate both, and both must produce the same value for the same input.

9. When would you NOT use a feature store?

Simple projects with one team, one model, and stable features. A feature store is significant infrastructure and adds overhead. Reach for one when you have multiple models sharing features, multiple teams, complex point-in-time requirements, or a consistency bug that has bitten you more than once.

10. What is a materialized vs. on-demand feature?

Materialized features are precomputed and stored - fast at inference, but can be stale. On-demand features are computed at request time - always fresh, but add latency. Real systems use both: stable features are materialized, freshness-sensitive features (like "time since last action") are computed on demand.

Part 3: Training Pipelines and CI/CD for ML

11. What does CI/CD look like for an ML project?

Code CI (tests, linting) plus data CI (schema validation, distribution checks) plus model CI (training runs, offline eval on a fixed test set, comparison to current production model). Deployment is gated on all three, not just one. If a candidate only talks about code CI, they are describing DevOps, not MLOps.

12. How do you test a training pipeline?

Small-scale end-to-end runs on a tiny dataset, unit tests on transformations, data validation against expected schemas and distributions, and golden tests where a known input produces a known output. The goal is to catch broken pipelines before they burn hours of compute.

13. How do you version data?

Depends on scale. For smaller datasets, DVC or LakeFS work. For large ones, versioned table formats (Iceberg, Delta Lake, Hudi) let you time-travel queries. The goal is that every model training run is reproducible - you can point at the exact data that produced it.

14. How would you design a retraining pipeline for a model that needs to update weekly?

Scheduled job that pulls the last N weeks of data, runs feature computation, trains, evaluates against a holdout, compares to the current production model, and either promotes or fails. The comparison is the critical step: retraining should only roll forward if the new model is actually better on the metrics that matter, not just "trained without error."

15. What is the champion-challenger pattern?

Run the new model ("challenger") in parallel with the current production model ("champion") on shadow traffic. Compare metrics. Promote only if the challenger wins on the metrics that matter, on realistic traffic. This is the mature version of "we retrained, let's ship it."

Part 4: Model Serving

16. Walk me through serving a model behind an API.

Load the model into memory in a serving process, expose an endpoint that preprocesses inputs, runs inference, and postprocesses outputs. At scale, this runs behind a load balancer with autoscaling, often with GPU-specific considerations. The specifics matter, but the key insight is that "inference" is only one step in a serving pipeline - preprocessing and postprocessing often dominate the code complexity.

17. What are the tradeoffs between real-time and batch inference?

Real-time: low latency, high infrastructure cost, predictable throughput is hard. Batch: higher throughput per dollar, but users wait. Many systems use both - real-time for user-facing predictions, batch for background scoring. The interesting design question is which category your use case actually needs, and candidates who default to "real-time" without considering batch tend to overbuild.

18. How do you handle model versioning in production?

Models as immutable artifacts with explicit version identifiers. Traffic routing is a separate concern - you can have v1.2.3 deployed and route 5% of traffic to it via a config change, without redeploying. The anti-pattern is embedding model version in code and requiring a code deploy to change what is running.

19. How do you warm up a model server?

Pre-load the model before accepting traffic, run a few synthetic inferences to JIT-compile or warm caches, only then register with the load balancer. Cold starts on a model server can be 30+ seconds, and hitting a cold server with real traffic causes timeouts.

20. What are the serving challenges specific to LLMs?

Long context windows mean per-request cost scales with input size. KV cache management for continuous batching. GPU memory is the bottleneck, not CPU. Latency is user-visible and unpredictable. Streaming changes your API surface. Inference optimization (quantization, speculative decoding, paged attention) matters more than it does for small models.

Part 5: Vector Databases and RAG Infra

21. When do you actually need a vector database?

When semantic search quality matters, when your corpus is too large to scan linearly (typically >100K documents), or when you need filtering on metadata alongside similarity. For a few thousand documents, an in-memory FAISS index is often sufficient and simpler.

22. Compare Pinecone, Weaviate, pgvector, and Chroma.

Pinecone: managed, fast, expensive at scale. Weaviate: self-hostable, richer querying, more moving parts. pgvector: vector search inside Postgres - great if you already have Postgres and your scale is moderate. Chroma: simple, good for prototypes. The honest tradeoff is managed vs. self-hosted and how much you want to pay in money vs. ops.

23. How do you evaluate a RAG pipeline in production?

Separate the retrieval step from the generation step. For retrieval: recall@k, MRR, nDCG against a golden set of query/document pairs. For generation: faithfulness (does the answer match the retrieved documents), answer relevance, and task-specific metrics. End-to-end metrics are useful but can hide which part of the pipeline is actually broken.

24. How do you handle updates to the knowledge base?

Incremental indexing. Documents are embedded and written to the vector store on change, old versions are deleted or marked stale. Periodically re-embed if the embedding model changes. The hard part is consistency: if a document is updated in the source system but not in the vector store, retrieval returns stale information that the model then confidently asserts.

25. What is "embedding drift" and how do you handle it?

When you upgrade your embedding model, all existing embeddings are no longer comparable to new ones. The mitigation is a version field on every embedding and a migration plan - typically re-embedding the whole corpus, which can be expensive. Treating embeddings as forever-immutable is a trap many teams fall into.

Part 6: Monitoring and Observability

26. What do you monitor for an ML model in production?

Input distributions (to catch data drift), output distributions (to catch model behavior changes), latency and error rates (standard system metrics), and business metrics (is the model actually helping). The first two are where teams most often fall short. A model can silently degrade for weeks without any infra alarm firing.

27. How do you detect data drift?

Statistical tests on input feature distributions: KS test for continuous features, chi-squared for categorical. Embedding-based distance metrics for text. Alert thresholds that are tight enough to catch real drift but loose enough to avoid alarm fatigue. The honest answer is that every production team calibrates this by trial and error.

28. How do you monitor a model when you do not have ground truth in real time?

Proxy metrics (did the user click, convert, come back), consistency checks against business rules, outlier detection on predictions, and delayed ground truth joined in as it becomes available. For LLMs, LLM-as-judge evaluations on a sampled stream. You will not have real-time correctness for most models - plan accordingly.

29. What is shadow mode and when do you use it?

Running a new model in production on real traffic but not exposing its predictions to users. You can compare shadow vs. production predictions without risk. Shadow mode is the default for new model launches in mature teams and is non-negotiable for anything user-facing.

30. How do you debug a model that is suddenly performing worse in production?

Data first: has the input distribution changed, is there a broken upstream source, has the feature pipeline regressed. Then model: has the version been rolled back, is there inference-time vs. training-time skew. Then downstream: is the metric itself broken. The order matters - model issues get blamed disproportionately when data is usually the culprit.

Part 7: Cost and Scale

31. How do you control inference cost at scale?

Right-sizing models (use the smallest model that hits quality bar), batching, caching, request shaping (deduplicating identical requests), and autoscaling. For LLMs specifically: prompt compression, semantic caching, routing to cheaper models for simple requests. Costs compound fast, and teams that do not track them quarterly end up surprised.

32. What is continuous batching?

A serving technique for LLMs where requests are dynamically added to an in-flight batch rather than waiting for the next batch to start. vLLM and TGI implement this. It significantly improves throughput and GPU utilization. Understanding this is a strong signal a candidate has actually served LLMs in production.

33. How would you serve a 70B parameter model cost-effectively?

Quantization (INT8 or INT4), tensor parallelism across GPUs, continuous batching, and sometimes speculative decoding. At some point the answer is also "pick a smaller model" - many teams over-invest in serving huge models when a 13B fine-tuned model would meet their quality bar at a fraction of the cost.

34. How do you decide between self-hosting a model and using an API?

Cost crossover (APIs are cheaper under a threshold, self-hosting wins above it), latency requirements, data privacy constraints, and ops capacity. Most teams should start with an API and only self-host when the economics or compliance clearly demand it. Going the other direction - starting self-hosted to "save money" on a small workload - is a common waste of engineering time.

35. What is the single highest-leverage investment for a team serving ML in production?

Observability and evaluation infrastructure. Everything else is downstream of being able to see what is happening. Teams that cannot measure the quality of their models in production will make slow, wrong decisions. Teams that can, iterate quickly. This is the common thread across every successful ML team we have seen.

Closing Thoughts

The MLOps interview bar in 2026 is higher than it was a year ago, but it is also more specific. Interviewers are not looking for candidates who can list tools. They are looking for candidates who can operate systems - who know where the failure modes are, how to measure what they cannot see, and how to make the tradeoffs that show up in production.

If you are preparing, the highest-leverage thing you can do is actually operate something. Deploy a model, let it break, debug it, write down what you learned. That experience produces the answers that separate strong candidates from memorized ones.

Want to practice MLOps and ML system design interviews with realistic feedback? gitGood.dev has 1,000+ practice questions, AI mock interviews, and coding challenges across every role hiring in 2026. Sharpen your judgment before the real thing.