Design an ML Model Serving Platform (TorchServe / Triton)
Online vs batch inference, GPU utilization tricks, autoscaling for spiky load, A/B testing models, and the feature store that decouples training from serving.
The problem
Design a platform that serves ML model predictions to internal product teams. Hundreds of models in production at any time, each with different latency, throughput, and accuracy SLAs. Some run on GPUs (LLMs, image models), others on CPU (tabular). Traffic is spiky: a recommendation model might idle at 100 req/sec and spike to 50K during a product launch. Teams want to A/B test model versions safely and roll back instantly when a new model regresses.
This is the canonical "ML platform infra" problem. Strong candidates separate online vs batch from the start, design for GPU utilization (the cost driver), and own the model registry + rollout story. Excellent candidates discuss the feature-store split between training and serving and explain why training-serving skew is the most common production ML failure.
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →What's the model mix - LLMs, computer vision, tabular, recommendation? Each has very different infra needs.
- →Latency SLAs - real-time (<100ms p99), interactive (<1s), batch (minutes-hours)?
- →Throughput - hundreds of QPS or millions?
- →Are inputs simple (a feature vector) or rich (long text, images, video)?
- →Stateful inference (chat sessions with KV cache, ranking with personalization) or stateless (single request, single response)?
- →How often do models update - hourly, daily, weekly? Hot-swap or full redeploy?
- →Multi-tenancy - one model per pod, or many models packed onto shared GPUs?
- →Training-serving feature parity required, or are some features online-only?
Requirements
Functional requirements
- ·Deploy a model: register a versioned artifact and route a portion of traffic to it
- ·Serve predictions: low-latency synchronous API + async batch API
- ·Route traffic between model versions (A/B, shadow, canary, rollback)
- ·Hot-swap a model with no in-flight request loss
- ·Per-model and per-version metrics (latency, throughput, accuracy proxies)
- ·Feature store integration: serving features must match training features
- ·Capacity isolation: a hot tenant can't starve other tenants
Non-functional requirements
- Scale
- 1000 active models. 100K aggregate QPS at peak. 1000 GPUs (mixed A100, L40, T4). 10x traffic variance day-vs-night. New model deploy: 30 minutes from artifact upload to live traffic.
- Latency
- Online inference p99 < 100ms for tabular, < 500ms for vision, < 2s for LLM completion. Batch inference: minutes to hours, throughput-optimized. Cold-start a new replica: < 60s.
- Availability
- 99.95% per model. Platform-level: 99.99% (a single model failure must not cascade). Rollback to prior version: < 60s.
- Consistency
- Best-effort. Stateless inference is naturally consistent. Stateful (chat KV cache, session-affine ranking) requires sticky routing. Feature freshness is an SLA, not strict consistency.
Capacity estimation
GPU economics
- An H100 costs ~$3-4/hr on cloud, ~$30K capex on-prem. The cost driver of any LLM/vision platform is GPU $/inference.
- Target utilization: 60-80%. Below 50%, you're paying for idle silicon. Above 90%, you're starving for headroom.
- Batching is the primary lever: a model that gets 10ms inference at batch=1 might get 30ms at batch=32 - 10x throughput at 3x latency.
Model footprint
- Llama 3 8B: ~16GB FP16. Fits comfortably on one A10/L40 with KV cache room.
- Llama 3 70B: ~140GB FP16, requires sharding across 2 H100s (or 4 A100-40GB).
- Stable Diffusion XL: ~10GB. Single GPU.
- Tabular XGBoost: ~MB. CPU is plenty.
Serving fleet sizing
- 100K aggregate QPS, 80% routed to top 10 models, 20% long-tail.
- Top 10 models at ~10K QPS each. With batch=16 per replica giving 200 inference/sec, need ~50 replicas per top model = 500 total replicas.
- Long-tail: 1000 models × 10 QPS avg with multi-model packing onto shared GPUs - 100 multi-tenant replicas.
Memory: KV cache for LLMs
- Per-token KV cache: ~0.5-1MB per token at FP16, depending on model.
- 10K concurrent chat sessions × 4K avg context × 1MB = 40 TB. Doesn't fit in GPU. Strategies: paged attention (vLLM), CPU offload, smaller context windows, eviction policy.
Network
- Per request: ~1-100KB input, ~1-10KB output. Steady-state ~1 GB/s aggregate. Trivial relative to GPU bottleneck.
Feature store
- Online feature reads: 100K req/sec × ~50 features × 100 bytes = 500 MB/s read. Redis or DynamoDB per tenant.
- Training: batch reads from offline feature store (Iceberg/Delta on S3). Daily refresh of online from offline.
High-level architecture
The platform splits cleanly into a control plane (model registry, deployment manager, traffic controller) and a data plane (inference servers behind a smart router). The control plane is small, transactional, and rarely-touched. The data plane is large, throughput-heavy, and the cost center.
The defining decisions: (1) one-model-per-replica vs multi-model-per-replica (the latter improves GPU utilization for long-tail models at the cost of complexity), (2) sync-only vs sync + async batch (every serious platform has both), (3) where the feature store lives in the request path (inline vs caller-fetched), (4) how to handle the LLM-specific concerns of streaming output, KV cache management, and continuous batching.
The defining operational property: every model deploy must be safely reversible. A bad model that ships to 100% of traffic is a P0 incident; the platform must make rollback a 60-second operation, not a re-deploy.
Model registry
Versioned artifact store + metadata. Each version: model binary, framework, input schema, expected feature list, accuracy metrics from training. The source of truth for what's deployable.
Deployment manager
Reconciles desired-state (this model, version V, N replicas, traffic share X%) against the actual fleet. Schedules pods onto the right hardware tier (GPU type, memory). Rolls out new versions with canary + auto-rollback hooks.
Inference router
Stateless front-door. Authenticates, attributes the call to a tenant, picks a model version per the routing policy (A/B, canary, shadow), forwards to a healthy replica. Single source of truth for traffic policy.
Inference server (TorchServe / Triton / vLLM)
Loads one or more models. Receives requests, batches them, runs the GPU inference, returns results. Triton supports multi-model packing; vLLM is purpose-built for LLM serving with continuous batching.
Batch inference scheduler
For async/large workloads: takes a job (model + inputs in S3 + output location), schedules it on spot GPUs when available, writes results back. Different cost profile from online (10x cheaper per inference).
Feature store (online)
Low-latency feature server. Same feature definitions as training; values updated from streaming or batch jobs. Inference replicas (or the router) fetch features by key per request.
Feature store (offline)
Lake-table store of historical feature values for training. Same schema as online; backfills from event streams. The online store is a materialization of the latest values.
Traffic controller
Owns A/B / canary / shadow / rollback policies per model. Pushes routing rules to the router. Provides a UI for SREs to flip traffic in seconds.
Observability stack
Per-model latency / QPS / GPU util / error rate. Drift detection on input distributions and prediction distributions. Alerting on accuracy proxies (downstream conversion, click-through, business metrics).
Autoscaler
Scales replicas per model based on QPS, queue depth, and GPU utilization. Different scaling rules for online (latency-sensitive, scale ahead) vs batch (throughput, scale to demand).
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. Online vs batch inference: the two modes and where they meet
Online and batch are different products built on the same model. Mixing them up wastes money and breaks SLAs.
Online (synchronous) inference
- One request → one response, latency-sensitive (<1s typical).
- Throughput per replica is low (10-1000 inference/sec).
- GPU utilization rarely tops 40% without aggressive batching.
- Cost per inference is high.
Use cases: real-time recommendations, chat, fraud scoring, autocomplete, search ranking.
Batch (offline) inference
- N requests in a job → N responses, throughput-sensitive (latency = job duration, minutes to hours).
- Per-replica throughput much higher (large batches, no SLA pressure).
- GPU utilization 80-95%.
- Cost per inference is 5-10x cheaper.
Use cases: nightly recommendation refresh, embedding all documents in a corpus, scoring all users for a churn model, generating image variants.
The cost gap
A daily batch job that scores 100M users at $0.0001/inference = $10K. The same workload online at $0.001/inference = $100K. The gap is real money - any team running batch as repeated online calls is leaving 90% on the table.
Hybrid: micro-batching
Some online use cases tolerate 50-200ms of batching window. Group concurrent requests into batches of 16-64; run as one GPU call; demux responses. Most LLM serving stacks (vLLM, TGI) do this automatically.
The "near-line" pattern
Pre-compute predictions for the active user base in a nightly batch; serve the cached predictions online. Trades freshness for cost. Common for recommendations: yesterday's "people you might know" served in <10ms from a cache, refreshed nightly via batch.
Spec'ing the right mode
Ask: what's the actual user-perceived latency requirement?
- "User clicks a button and waits for a result" → online.
- "User opens an app and sees pre-computed feed" → near-line (cached batch).
- "We score users overnight and email them tomorrow" → batch.
Most "real-time" requirements are actually "fast enough that the user doesn't notice" - which is often near-line, not online. The cost difference justifies the conversation.
2. GPU utilization: batching, packing, KV cache, and the cost ceiling
GPUs are the cost driver. Every percentage point of utilization is real money. The techniques are well-known; the engineering is in applying them rigorously.
Static batching
Wait for B requests, run them as one GPU call, return all results.
Pros: simple, predictable.
Cons: tail latency suffers (last request waits for the batch to fill); first request waits for B-1 others.
Use case: classic ML inference (CV, tabular). Pick B based on the latency budget.
Continuous (dynamic) batching
Don't wait for the batch to fill. Each new request joins the next available slot in the GPU's batch; finished requests exit early. Used in vLLM, TGI, TensorRT-LLM.
Pros: massive throughput improvement for variable-length outputs (LLMs especially). Better latency than static batching for the same throughput.
Cons: more complex scheduler; requires kernel-level support.
For LLMs, this is the difference between 100 tokens/sec and 1000 tokens/sec on the same GPU.
Multi-model packing
Load 5-10 small models on the same GPU. Route requests to the model on-demand. Reduces idle GPU time for long-tail models.
Risks:
- Cold model swap latency (load model into VRAM): 1-10s for a multi-GB model. Cache hot models; spill cold ones.
- Memory contention: one model OOMs the GPU and crashes neighbors. Requires hard memory caps.
Triton supports this natively with model ensembles + dynamic loading.
Quantization
Reduce model weights from FP16 to INT8 or INT4. 2-4x smaller memory footprint, 1.5-3x faster inference, small accuracy loss (typically <1%).
Use AWQ, GPTQ, or FP8 (H100-native). For production, validate accuracy regression on a held-out test set before rolling out.
KV cache (LLM-specific)
The cache of attention keys/values from prior tokens. Grows linearly with context length × concurrent sessions.
A 70B model with 4K context per session × 100 concurrent sessions = ~10GB just for cache. With paged attention (vLLM), the cache is virtualized into pages reused across requests; effective utilization 5-10x higher than naive.
Speculative decoding
A small "draft" model generates N tokens; the big model verifies them in parallel. If the draft is right, you get N tokens in one big-model call instead of N. 2-3x speedup for chat-style workloads.
The ceiling
Even with everything above, GPU utilization in production rarely tops 70-80%. Headroom is needed for traffic spikes; some models inherently can't batch well; cold models reload. Below 50%, you have a problem; above 80%, you're flying close to the sun.
The cost-per-inference rubric
- Naive single-request inference on an LLM: $0.01-0.10 per 1K tokens.
- Continuous batching: $0.001-0.01 per 1K tokens.
- Quantized + speculative + batched: $0.0005-0.005 per 1K tokens.
- Open-source models on commodity GPUs vs API providers: 5-10x cheaper at scale.
The interview signal: candidates who own the cost model and explain the levers score way higher than candidates who say "we'd add more GPUs".
3. Autoscaling for variable load: scale by what, exactly?
Naive HPA on CPU breaks for ML serving. The right signals are model-specific and require thought.
The problem with CPU-based autoscaling
ML inference servers spend 95% of CPU time blocking on GPU calls. CPU stays low even when the GPU is saturated. CPU-based HPA never triggers; the GPU melts.
Better signals
- GPU utilization: scale up when avg GPU util > 70%. Lags by ~30s but reflects the bottleneck.
- Queue depth at the inference server: each server's request queue. When queue grows, scale. Faster to react than utilization.
- Request latency p99: scale when p99 exceeds SLA. Reactive but user-aligned.
- Concurrent in-flight requests per replica: bounded by replica capacity; predictable scaling.
Combine: target GPU util 60% normally, escalate on queue depth / latency for spikes.
Scale-up speed
- Cold start: pull model artifact (10s-1min for multi-GB), load to GPU memory (5-30s), warm-up forward pass (1-5s).
- Total: 30-90s. Too slow for instant traffic spikes.
Mitigations:
- Pre-warmed pool: keep N idle replicas ready; replace when used. Costs idle GPU.
- Burst mode: sustained traffic gets dedicated capacity; spikes route to a multi-tenant pool with degraded SLA.
- Predictive scaling: anticipate scheduled events (product launch, marketing email blast, daily peak hours) and pre-provision.
Scale-down hysteresis
Aggressive scale-down causes thrashing. Standard: scale up immediately, scale down after 10-15 minutes of low utilization. Cost: idle GPU time. Benefit: no flapping.
Burst pricing
Spot/preemptible GPUs are 50-70% cheaper than on-demand. Use spot for batch and for the "headroom" tier. Reserve on-demand for the SLA-bound critical models.
Per-tenant fairness
Multi-tenant platforms need per-tenant rate limits + queue isolation. Without it, one team's runaway job starves the rest. Enforce at the router (token bucket per tenant) and at the inference server (per-tenant queues, weighted fair queueing).
Autoscaler failure modes
- Death spiral: load increases → latency rises → requests retry → more load. Fix: rate-limit at the router, prefer 503 to retries.
- Scale-up lag during cold-start: new replicas don't help for 60s. Fix: pre-warmed pool, faster artifact distribution (S3 → local cache → mmap).
- Wrong signal: scaling on CPU misses the bottleneck. Fix: use GPU + queue depth.
Cost target
Steady-state $0.50-2.00 per 1M tokens for popular open models served well-utilized; $5-50 per 1M tokens for OpenAI / Anthropic API. The gap funds the entire engineering team for medium-scale platforms.
4. A/B, shadow, canary, rollback: the safe deploy story
Models drift. New versions can regress on segments invisible during training. The platform must make every deploy safely reversible.
Canary
Route 1% → 5% → 25% → 100% of traffic to the new version over hours. Monitor latency, error rate, and accuracy proxies (downstream conversion, click-through). Auto-rollback on regression.
The standard for low-risk deploys.
A/B testing
Route 50/50 to old and new versions for a defined window (often a week). Measure business metrics (revenue per user, retention) to decide. Required for any model change with potential business impact - "the new model has higher offline accuracy" doesn't always mean higher business value.
Key infra: stable hashing of users to the same variant (don't flip a user mid-experiment), exposure logging, statistical significance gates before declaring a winner.
Shadow (dark launch)
Send 100% of traffic to both old and new. Use only the old's response (so users see no change). Compare predictions offline.
Pros: catch behavioral regressions before users see them. No business risk.
Cons: 2x inference cost during the shadow window. Can't validate end-to-end business outcomes (since users only see old).
Use for: high-stakes changes, model architecture overhauls, framework migrations.
Multi-armed bandit
Adaptive routing: variant that's winning gets more traffic, in real time. Smarter than A/B for cases where the winner is clear early.
Cons: harder to interpret, harder to halt cleanly, more complex stat machinery.
Use selectively - don't make MAB the default; A/B is more transparent.
Rollback
The single most important capability. Must be:
- One operator action (one click or one CLI command).
- Effective in <60s end-to-end (router config update + propagation).
- Reversible (rollback the rollback if needed).
Pattern: keep N-1 and N+1 versions warm. Rollback = flip router weight. Re-deploying the old artifact takes 30 minutes; flipping a weight takes 5 seconds.
Auto-rollback gates
The platform should auto-rollback on:
- Error rate spike (>2x baseline).
- Latency spike (>1.5x baseline p99).
- Accuracy proxy regression (downstream conversion drop > X%).
- Drift detection on input distribution.
False positives are annoying; false negatives ship bad models. Tune thresholds per model.
Per-tenant routing
Sometimes one tenant needs the old version (compliance freeze, integration testing) while everyone else moves on. The router should support per-tenant overrides as a supported pattern, not a hack.
Rollout audit
Every traffic-routing change is logged: who, when, from version → version, percent. Required for incident postmortems and regulatory audit.
5. Feature store: training-serving parity and the most common production ML failure
Most production ML failures aren't model bugs - they're feature mismatches between training and serving. The feature store exists to prevent that.
The problem
At training time, a feature like "user's average order value over the last 30 days" is computed from a historical batch table.
At serving time, the same feature is computed live from an online store.
If the two definitions drift even slightly - different rounding, different timezone, different definition of "30 days" - the model's predictions silently degrade.
This is the #1 production ML failure mode. It rarely shows up in dashboards because the model's offline accuracy looks fine; only A/B against a known-good baseline reveals the regression.
The feature store solution
A single source of truth for feature definitions:
- One DSL or codebase defines each feature (e.g., "AVG(order.value) WHERE order.user_id = X AND order.timestamp > NOW() - 30 days").
- The same definition is materialized into both an offline store (for training) and an online store (for serving).
- Models reference features by name + version, never by raw computation.
Tools: Feast (open source), Tecton, Hopsworks, Vertex AI Feature Store, DataBricks Feature Store.
Online vs offline split
- Offline store: lake-table (Iceberg/Delta on S3). Historical feature values, point-in-time correct, used for training.
- Online store: low-latency KV (Redis, DynamoDB, ScyllaDB). Latest feature values, used for serving.
Both are populated from the same compute jobs - typically streaming for fast-changing features (last 5 minutes of activity) and batch for slow-changing ones (90-day rolling stats).
Point-in-time correctness for training
A common training bug: using "the user's account_status today" when training on data from last month. The feature has leaked future information.
The feature store solves this with as-of joins: at training time, retrieve features as they were at the timestamp of each training row. Tools that don't support as-of joins force teams to do this manually, badly.
Feature freshness SLAs
Different features have different freshness needs:
- "User's lifetime spend" - daily refresh is fine.
- "Items in user's cart right now" - sub-second.
- "User's last 5 page views" - seconds.
The platform must support per-feature freshness tiers without forcing every feature into the most-expensive tier.
Feature drift detection
Monitor each feature's distribution online; alert when it shifts significantly from the training distribution. Catches: data pipeline broke, upstream service changed schema, business event invalidated the model's assumptions.
The interview signal: candidates who introduce feature stores as a "thing teams need" score higher than candidates who treat feature engineering as a per-team concern. At scale, the lack of a feature store guarantees training-serving skew.
6. Observability for ML: what to monitor when the model itself can be wrong
Traditional service observability (latency, errors, RPS) is necessary but not sufficient for ML. The model can return 200 OK with a wrong answer.
Standard service metrics
- Latency: per-model p50/p99, broken down by version.
- Throughput: QPS per model + per tenant.
- Error rate: 4xx (bad input), 5xx (server / inference error).
- GPU utilization, memory pressure, queue depth.
ML-specific metrics
- Input drift: distribution of features over time. Alert when a feature's distribution shifts significantly (KS test, PSI, wasserstein distance).
- Prediction drift: distribution of model outputs. A recommendation model that suddenly recommends only one category is broken even if latency is fine.
- Confidence drift: average confidence/probability of predictions. Sudden drops mean the model is OOD on production data.
- Accuracy proxies: downstream business metrics (click-through rate for ranking, conversion rate for recommendations, dispute rate for fraud). The most important signals; the slowest to update.
- Coverage: % of requests where the model has the data it needs (feature lookup hit rate).
Lagging vs leading indicators
- Latency / errors: real-time.
- Drift: 5min-1hr lag.
- Accuracy proxies: hours to days.
You can't wait days for a regression. Drift is the fastest leading indicator we have for "the model is starting to be wrong"; alert on it aggressively.
Per-segment monitoring
Aggregate accuracy can hide segment regressions. A new model can be 1% better on average and 10% worse for users on Android. Slice monitoring by user cohort, geo, platform, traffic source.
Shadow scoring
Continuously run a "champion" baseline alongside production. Compare prediction agreement rate; alert when divergence grows. Catches model bugs that don't trigger drift alerts.
Audit log of predictions
For regulated use cases (lending, hiring, healthcare), every prediction is logged with input features + version + output + confidence. Required for compliance and for retroactive debugging.
Cost observability
- $/1K inferences per model.
- GPU $/hr per model (utilization × on-demand price).
- Wasted GPU $ from idle replicas.
Without cost dashboards, ML platform spend balloons. Mature teams alert on cost-per-inference regressions like they alert on latency regressions.
The "what changed" question
When a model regresses, the question is always "what changed?" Possibilities:
- New model version was deployed.
- Feature definition changed.
- Upstream data pipeline changed.
- Real-world distribution shifted (covid, seasonal, viral content).
- Adversarial behavior.
A platform that lets the on-call engineer answer "what changed in the last hour for model X" wins. Versioned everything (model, features, routing config, infra) is the prerequisite.
Trade-offs
Online vs batch
Online for sub-second user-perceived latency; batch for everything else. Most "real-time" requirements are actually near-line - cached batch served fast. Resist building online when batch suffices.
One model per replica vs multi-model
One-per-replica is simpler and isolation-clean; multi-model improves utilization for long-tail models at the cost of complexity. Top-N models get dedicated replicas; long-tail packs onto shared GPUs.
TorchServe vs Triton vs vLLM vs custom
TorchServe for PyTorch-native simplicity; Triton for multi-framework + multi-model; vLLM (or TGI / TensorRT-LLM) for LLM serving. Custom only if existing servers genuinely don't fit - rare.
Self-hosted models vs managed APIs (OpenAI / Anthropic)
Managed for fast time-to-market and unmatched model quality; self-hosted for cost (5-10x cheaper at scale) and data privacy. Many teams use both - managed for the hardest queries, self-hosted for the bulk.
Autoscale on GPU util vs queue depth
GPU util reflects the bottleneck but lags; queue depth reacts faster but doesn't capture saturation. Combine: GPU util as steady-state target, queue depth + latency as spike triggers.
Pre-warmed pool vs cold scale-up
Pre-warmed pools cost idle GPU but eliminate cold-start lag. Cold scale-up is cheap but takes 30-90s. For latency-critical models, pay for warmth; for batch, scale cold.
Inline feature lookup vs caller-fetched
Inline (router fetches features, passes them to inference) centralizes the feature lookup but adds router complexity. Caller-fetched keeps the router simple but pushes feature-store knowledge to every caller. Inline is the modern default.
A/B vs shadow vs canary
Canary for low-risk deploys; A/B for business-impact validation; shadow for high-stakes architecture changes. Most teams need all three; building one is a step on the path to building the rest.
Buy a feature store vs build
Feast / Tecton / cloud feature stores remove most of the work; building gives full control and zero vendor lock-in. Below 100 features, build feels lighter; above, the platform pays for itself in correctness alone.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How would you serve a 70B-parameter LLM at $0.001 per 1K tokens?
- ?What's your story for cold-starting a multi-GB model in <10 seconds?
- ?How do you A/B test two models when the metric of interest is downstream conversion 30 days later?
- ?What happens when one tenant's model loops infinitely on a malformed input?
- ?How would you implement multi-tenant fair-share scheduling on a shared GPU pool?
- ?What's your detection + mitigation for training-serving feature skew?
- ?How do you handle a scheduled marketing email blast that triggers 100x traffic in 5 minutes?
- ?What's your strategy when a critical model regresses on a small but loud user cohort?
Related system design topics
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →