Design a Video Streaming Service (YouTube / Netflix)
Encoding ladders, adaptive bitrate, CDN economics, and the difference between live and VOD. Petabyte-scale storage meets millisecond-scale playback.
The problem
Design a service that lets users upload video, stores it durably, transcodes it into multiple qualities, and streams it to viewers globally with sub-second startup and adaptive quality. Variants: VOD (video on demand, like YouTube/Netflix), live streaming (Twitch/YouTube Live), or both.
Video is one of the highest-bandwidth, highest-cost design questions. The interesting decisions are: how to transcode efficiently, how the encoding ladder is structured, how the CDN absorbs petabit-scale traffic, and the operational cost of every choice.
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →Are we designing VOD, live, or both? They share storage but diverge sharply on the playback path.
- →What's the upload volume - hours of video per day? Average video length?
- →How many concurrent viewers at peak - 1M? 100M? The CDN strategy depends sharply.
- →What qualities do we serve - SD, HD, 4K? Mobile-only?
- →Are we serving global users (CDN strategy is critical) or single-region?
- →What's the latency budget for live - 30 seconds is acceptable, sub-second (low-latency live) is much harder?
- →Do we need DRM, watermarking, geo-blocking?
- →Monetization model - ads (which inject mid-stream), subscription, free?
Requirements
Functional requirements
- ·Upload video file or live stream
- ·Transcode into multiple qualities (encoding ladder)
- ·Generate streaming manifest (HLS or DASH)
- ·Stream video to client with adaptive bitrate
- ·Generate thumbnails, previews, scrub-bar art
- ·Serve via global CDN with low first-byte latency
- ·Track watch progress, recommendations input
Non-functional requirements
- Scale
- 1B DAU. 500 hours of video uploaded per minute (YouTube-scale). 100B watch events per day. Average video length 10 minutes; long-tail extends to 12-hour streams. Storage: exabyte-scale corpus.
- Latency
- Time-to-first-frame p99 < 2 seconds (the user pressed play; they're waiting). Adaptive bitrate decisions in <1 second. Live latency: 5-30 seconds is standard; <2 seconds is 'low-latency' tier (much harder).
- Availability
- 99.99% on playback. A 5-minute outage during a live event is catastrophic. Upload availability can be lower (99.9%); users will retry.
- Consistency
- Eventually consistent. A new upload appears in search within minutes. Watch progress syncs eventually across devices.
Capacity estimation
Storage
- 500 hours/minute × 60 × 24 × 365 = 263M hours/year of source video.
- Source size: 4K at 30Mbps = ~13.5 GB/hour. 263M × 13.5 GB = 3.5 EB/year of source video.
- Encoding ladder produces 5-7 outputs per source. Total stored: 5-10x source = 17-35 EB/year.
- 5-year retention: 85-175 EB. This is exabyte-scale even after aggressive cold tiering.
Egress bandwidth
- 1B DAU × 30 min/day × 5 Mbps average = 9 Tbps × 30 min = 1.6 EB/day egress = 18 Tbps sustained.
- Peak (Super Bowl, premiere): 5x average = 90 Tbps. CDN economics dominate everything.
Transcoding compute
- 500 hours/min input. Encoding 4K → 1080p in real-time needs ~1 GPU-second per video-second. Real-time transcode of 500 hours/min = 500 GPU-cores running 24/7. With ladder (5-7 outputs), 2,500-3,500 GPU-cores sustained for transcoding alone.
Live ingestion
- For live streams, ingest peak might be 100K concurrent broadcasters at 5-10 Mbps each = 1 Tbps ingress. Edge ingestion + immediate transcoding pipeline.
Manifest serving
- HLS/DASH manifest requests: ~1 per minute per active viewer. 1B viewers × 1/min = 16M qps manifest requests. Cached at edge (manifests are tiny, change rarely).
High-level architecture
Two distinct pipelines share the storage layer.
Upload + transcode (offline)
Source upload → object storage → transcode workflow → encoding ladder outputs → CDN origin → CDN edge → viewer.
Live (streaming)
Broadcaster → ingest endpoint (regional) → live transcoder → segment storage (short retention) → CDN edge → viewer. Latency: ingest-to-viewer is the SLA.
Shared layers
Object storage (source + outputs), CDN, manifest service, analytics pipeline, recommendation/discovery service.
Upload service
Resumable uploads (TUS or S3 multipart). Validates, deduplicates, stores raw source in object storage. Triggers transcode workflow.
Transcode workflow orchestrator
Step Functions / Airflow / proprietary. Splits source into chunks, fans out to GPU transcode workers, assembles outputs at each ladder rung, generates HLS/DASH manifest.
Transcode worker pool
GPU-accelerated. Encodes a chunk into one ladder rung output. Stateless workers; orchestrator handles state.
Object storage
Source video and all encoding ladder outputs. Multi-region replication for hot content; single-region cold tier for long-tail.
Manifest service
Generates and serves HLS/DASH manifests. Cached at CDN edge. Includes ad insertion markers if applicable.
CDN (origin + edge)
Edge servers globally. Caches segments. Origin pulls from object storage on miss. Petabit-scale capacity.
Live ingestion gateway
RTMP / WebRTC / SRT endpoints regionally distributed. Broadcasters connect; gateway forwards to live transcoder.
Live transcoder
Real-time transcode of incoming live streams. Outputs HLS segments (typically 2-6s each) to segment storage.
Watch service
Tracks per-user watch progress, history. Source for recommendations.
Analytics pipeline
Watch events, completion rates, quality metrics. Drives recommendations, ads, business analytics.
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. The encoding ladder: what to encode and why
An encoding ladder is the set of (resolution, bitrate, codec) tuples generated for each source video. The ladder is the central economic decision in video.
Standard ladder (illustrative)
| Rung | Resolution | Bitrate | Codec | Audience |
|---|---|---|---|---|
| 1 | 240p | 400 kbps | H.264 | Slow mobile |
| 2 | 360p | 800 kbps | H.264 | Mobile |
| 3 | 480p | 1.4 Mbps | H.264 | Mobile/wifi |
| 4 | 720p | 3 Mbps | H.264 | HD desktop/wifi |
| 5 | 1080p | 6 Mbps | H.264 | Full HD |
| 6 | 1440p | 12 Mbps | H.265 | High-end |
| 7 | 4K | 25 Mbps | H.265/AV1 | 4K screens |
Codec choices
- H.264 (AVC): universal compatibility. ~40% bitrate of older codecs. Free for most uses.
- H.265 (HEVC): 50% more efficient than H.264 for same quality. Patent thickets - licensing $$.
- AV1: open, ~30% more efficient than H.265, but encoding cost is 5-10x.
- Most services use H.264 + H.265 + AV1 in tiers - H.264 universal, AV1 for the most-watched content where transcoding cost amortizes.
Per-title encoding (Netflix's innovation)
Different content compresses at different rates. A still-frame interview compresses far smaller than an action sequence at the same quality. Per-title encoding analyzes each video's complexity and produces a custom ladder, saving 20-30% of bandwidth.
Encoding chunks
Source video is split into 2-10 second chunks. Each chunk is independently encoded by a different worker. Embarrassingly parallel - 1 hour of video transcodes in 1 minute on 60 parallel workers.
Storage bloat
A 1-hour 4K source = 13 GB. With a 7-rung ladder, total encoded outputs = ~25 GB (the lower rungs are tiny). Aggressive lifecycle: low-watch tail content can drop to 3 rungs (mobile-friendly only) after a year, saving storage.
2. Adaptive bitrate (HLS / DASH)
Adaptive bitrate (ABR) means the player switches between encoding ladder rungs as network conditions change. The two protocols are HLS (Apple) and DASH (everyone else).
Manifest structure
The manifest lists all available rungs and the URLs of their segment files. Player downloads the manifest, picks a starting rung based on initial bandwidth measurement, requests segments, monitors download time, switches rungs up or down.
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=240p
240p.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=720p
720p.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1080p
1080p.m3u8
Segment design
Segments are typically 2-10 seconds long. Shorter = faster quality switching, more manifest churn. Longer = smoother playback, slower adaptation.
- 6-second segments are a common compromise.
- Live often uses 2-second segments to minimize end-to-end latency.
Switching algorithm
Player measures recent download throughput. If bandwidth > current rung × 1.5, step up. If bandwidth < current rung, step down (to avoid stalling).
- Naive throughput-based ABR is sensitive to network jitter.
- Buffer-based ABR (BOLA, MPC) considers buffer fullness - drop quality if buffer is shallow even if throughput briefly recovers.
- ML-based ABR (Pensieve) trains on network traces; outperforms heuristic ABR by 10-15%.
Initial rung selection
The first segment is the slowest to download because we don't know bandwidth yet. Start low (480p) for fast first-frame, ramp up after 1-2 segments. Trade-off: low first-frame quality vs slow startup.
Pre-fetching and CDN warming
Player fetches the next 1-2 segments speculatively. The CDN must serve them with low TTFB - a slow CDN miss kills perceived quality. Aggressive edge caching is essential.
3. CDN economics and global delivery
Video delivery is a CDN problem. The transcoding pipeline produces outputs once; the CDN delivers them billions of times. CDN cost is the dominant operational expense.
Cache hit rate matters everything
At 99% hit rate, origin egress = 1% of total. Origin storage costs are bounded by content volume. CDN costs scale with viewers.
At 95% hit rate, origin egress = 5%. 5x the storage egress cost. Cache hit rate is a critical KPI.
Cache hierarchy
- Edge POPs (close to users): hot content. Tiny capacity per POP, lots of POPs.
- Mid-tier (regional): medium-warm content. Used as cache fill for edge.
- Origin (object storage): everything. Expensive to read; minimize hits.
Hot content is short
80% of views go to <1% of content. The hot working set fits in edge cache. Long-tail content takes a small bandwidth share but lives mostly in origin.
Cache key design
Each segment is keyed by (video_id, ladder_rung, segment_id). Identical content → same key globally. Personalized content (ads inserted) breaks caching - design ad insertion to be client-side or done at origin once per ad.
CDN strategy: build vs buy
- Buy: Akamai, Cloudflare, Fastly. Easy to start, expensive at exabyte-scale.
- Build: Netflix Open Connect, YouTube Edge Cache. Custom hardware in ISP networks. Saves 50-70% of CDN cost at scale.
- Hybrid: own POPs in primary markets + commercial CDN for long-tail geographies.
Multi-CDN
Many services use 2-3 CDN providers and route based on cost / performance / capacity. Multi-CDN offers redundancy (one CDN going down doesn't kill playback) and pricing leverage.
The interview signal
CDN cost is the dominant operational concern in video. Candidates who design without acknowledging CDN economics miss the central trade-off.
4. Live streaming: a different beast
Live streaming shares the storage and CDN layers but the playback path is different.
Ingest
Broadcasters connect to a regional ingest endpoint via RTMP, SRT, or WebRTC. Ingest gateway authenticates, forwards to live transcoder.
Real-time transcoding
Live transcoder must keep up with input. 1 hour of live = 1 hour of transcode (real-time, not batch). GPU-accelerated; same encoding ladder concept as VOD but with tighter latency.
Segment generation
Live transcoder writes new segments every 2-6 seconds. Each segment is uploaded to a segment store (object storage with very short retention) and added to the live manifest.
Manifest update
Live manifests are periodically refreshed - players poll for the latest segment list. Usual cadence: every 2-3 seconds.
Latency sources (60 seconds total in standard live)
- Encoding: 5-10 seconds buffer for encoder lookahead.
- Segment buildup: 6 seconds (one segment).
- CDN cache fill: 1-3 seconds.
- Player buffer: 30-60 seconds (stability vs freshness trade-off).
- Total: 30-60 seconds end-to-end is typical.
Low-latency live (LL-HLS, DASH-LL, WebRTC)
- LL-HLS uses sub-second segments + chunked transfer encoding. End-to-end latency ~2-5 seconds.
- WebRTC: <1 second, but doesn't go through CDN (peer-to-peer / SFU). Different infrastructure.
- Trade-off: lower latency = more sensitive to network glitches = more rebuffers.
Live stream lifecycle
- Pre-stream: broadcaster authenticates, gets ingest endpoint URL.
- During: real-time pipeline runs, segments accumulate.
- End: stream is finalized, segments rolled into a VOD asset, transcoded again at higher quality (offline mode), made available as VOD.
Concurrent broadcasters at scale (Twitch)
At 100K+ concurrent broadcasters, the live transcoder pool dominates compute cost. Many broadcasters have very few viewers - cheap-encode for them; reserve high-quality encode for popular streams. Dynamic allocation based on viewer count.
5. Storage tiering and the long-tail
Exabyte-scale storage at petabyte cost requires tiering.
Hot tier (last 30 days)
SSD-backed object storage. Read-optimized. ~$0.025/GB/month. Holds: recently uploaded videos, currently popular content.
Warm tier (30 days - 1 year)
HDD-backed object storage. ~$0.008/GB/month. Holds: most videos. Reads still serve in milliseconds.
Cold tier (>1 year)
Glacier / nearline. ~$0.001/GB/month. Reads take minutes. Holds: long-tail content with very low watch rate.
Re-tiering
Background process: track watch rate per video over rolling 30-day window. Move videos with watch rate < threshold to colder tiers. Move videos with sudden watch spike to hotter tiers (e.g., a 5-year-old clip goes viral).
Source vs ladder retention
Source video is the most expensive (4K = 13 GB/hour). Outputs are cheaper. Some services delete the source after 1 year and re-transcode from the highest-rung output if needed (loses some quality but saves PB).
Replication strategy
- Hot: 2-region replication for HA + DR.
- Warm: single-region with erasure coding (saves 30-50% over 3x replication).
- Cold: single-region. Restore time is part of the SLA.
Lifecycle automation
S3 / equivalent lifecycle policies. Encode the rules (after N days move to warmer/colder). Don't write your own lifecycle service - cloud providers do this well.
The economics
At YouTube-scale (1+ exabyte hot, 100s of EB warm/cold), storage tiering saves $100M+/year vs all-hot. This is one of the highest-leverage operational decisions in the system.
6. Search, discovery, and recommendations
The video pipeline is necessary but not sufficient. Discovery is what keeps users engaged.
Search index
Title, description, tags, transcript indexed in ElasticSearch / equivalent. Updated on upload. Includes engagement signals (views, like ratio) for ranking.
Recommendations pipeline
Two-tower neural network common today. Inputs: user history, video metadata, contextual features. Outputs: top-K candidate videos. Re-ranked by online model with freshness, diversity, calibration.
Watch graph
User → video edges with weights (watch duration, engagement). Used both for collaborative filtering and for content discovery. Stored in graph DB or as sparse matrix in object storage.
Trending detection
Real-time: video gets X views in Y minutes → flag as trending. Needs streaming pipeline (Flink, Spark Streaming) consuming watch events.
Personalization vs editorial
Pure personalization creates filter bubbles. Most platforms blend personalized recommendations with editorial picks (trending, recommended-for-you mixed with what's hot globally).
Cold-start (new content)
A just-uploaded video has no watch signals. Bootstrap with: creator's prior history, content metadata (transcript, visual features from CV models), thumbnail click-through rate. After 1-10K views, use real engagement.
Cold-start (new users)
Newly registered. Show globally trending + a wide content sample. Lock in preferences quickly via implicit signals.
The interview signal
Designing the playback pipeline correctly without the discovery layer means designing a successful upload tool, not a successful product. Always note discovery as a critical adjacent system.
Trade-offs
Encoding ladder breadth vs storage cost
More ladder rungs = better quality match per device = better experience. More rungs = more storage. Most services run 5-7 rungs and reduce to 3-4 for long-tail content.
Codec choice: licensing vs efficiency
H.264 is universal but bandwidth-expensive. H.265 saves 50% bandwidth but costs licensing. AV1 saves another 30% but costs encoding compute. Mix codecs strategically by content tier and viewer device.
Live latency vs stability
Low-latency live (sub-second) requires shorter segments, smaller buffers, more rebuffer risk. Standard live (30s) is more stable but feels less "live." Match the latency tier to the use case (sports → low-latency, recorded events → standard).
CDN: build vs buy
Buy is the right answer at small scale. Build is the right answer at exabyte scale. The crossover happens around $50M/year of CDN spend.
Storage tier aggressiveness
Aggressive cold-tiering saves money but adds latency for long-tail content (rare views take minutes to restore). Conservative tiering keeps everything fast but expensive. Tune based on actual watch distribution.
Per-title encoding investment
Per-title encoding saves 20-30% bandwidth but costs analysis compute on every upload. Worth it for high-watch content; overkill for low-watch.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How would you support DRM (Widevine, FairPlay) without breaking the CDN cache?
- ?What changes for a sports event with 100M concurrent live viewers?
- ?How do you ensure regional content blocking (geo-restrictions)?
- ?How would you migrate from H.264 to H.265 across the existing corpus?
- ?What's your strategy for offline downloads (Netflix, YouTube Premium)?
- ?How would you support real-time collaboration features on live streams (chat, polls, reactions)?
- ?How do you handle content moderation at upload time (CSAM, copyright)?
- ?What changes for a low-bandwidth market (India 2G/3G)?
Related system design topics
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →