Design a Load Balancer (L4 vs L7, Envoy / HAProxy / ALB)
L4 vs L7, consistent hashing, health checks, connection draining, and the difference between a fleet that survives partial failures and one that cascades into outage.
The problem
Design a load balancer that fronts a fleet of backend servers, distributes incoming connections across them, detects unhealthy backends and removes them from rotation, drains connections cleanly during deploys, and survives its own component failures without dropping traffic.
This is the canonical "thing every system has but nobody designs from scratch" problem. Strong candidates separate L4 vs L7 explicitly, walk through health-check semantics, and explain why consistent hashing matters for cache-affinity workloads. Excellent candidates discuss the LB itself as a tier that needs HA - "who load-balances the load balancers?".
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →Workload type - HTTP, gRPC, raw TCP (database connections), UDP (DNS, QUIC)?
- →Throughput - 10K, 100K, 1M req/sec? Aggregate Gbps?
- →TLS termination at LB or end-to-end?
- →Sticky sessions required, or stateless backends?
- →Inside one VPC, across regions, or both?
- →Health check granularity - per-host or per-backend-instance?
- →Failure isolation - protect a noisy tenant from impacting others?
- →Hardware (F5, Citrix) acceptable, or software-only (Envoy, HAProxy)?
Requirements
Functional requirements
- ·Distribute incoming connections across N backends per a configured policy
- ·Health-check backends; remove unhealthy from rotation; restore on recovery
- ·TLS termination with SNI; optional re-encrypt to backends
- ·Connection draining on backend removal (in-flight requests complete)
- ·Sticky sessions via cookie or hash for stateful backends
- ·Per-backend / per-route weighting and failover
- ·Observability: per-backend metrics (RPS, latency, errors), connection counts
Non-functional requirements
- Scale
- 1M req/sec per LB instance peak. 100K active backends. Multi-AZ deployment. 100 Gbps aggregate throughput.
- Latency
- LB-added latency p99 < 5ms (excluding backend response). Health check loop p99 < 5s for failure detection. Connection drain p99 < 30s for in-flight completion.
- Availability
- 99.99% - the LB tier must be more reliable than the backends it fronts. Failure of a single LB instance should be invisible to clients (anycast / DNS round-robin / VIP failover).
- Consistency
- Eventually consistent backend membership. A removed backend may still receive traffic for a few seconds. Sticky-session affinity is best-effort during topology changes.
Capacity estimation
Throughput
- 1M req/sec / LB. Per-request overhead ~10us at L7, ~1us at L4. CPU bound on TLS handshakes (~1ms each cold, ~50us resumed) and L7 parsing.
- A single Envoy / HAProxy instance handles 100K-200K req/sec on commodity hardware. 8 instances per LB tier for headroom.
Connections
- HTTP/1.1: 1 conn per active client. 1M req/sec at 100 req/sec/client → 10K active conns.
- HTTP/2: 1 conn carries many streams. Same load → 1K-10K conns.
- TCP fanout to backends: per-conn or pooled. With pooling, ~100-1000 backend conns per LB instance.
Memory
- Per-conn state (L7): ~10KB. 100K conns × 10KB = 1 GB. Trivial.
- Routing config: backend addresses, weights, route rules. ~10MB even for huge clusters.
- TLS session cache: 100K sessions × ~1KB = 100MB.
Health check overhead
- 100K backends × 1 check / 5s = 20K checks/sec. Each ~1ms = ~20ms CPU/sec. Background, negligible.
Failure detection latency
- Aggressive (1s interval, 3 fails): detect in 3s. False positives on transient blips.
- Standard (5s interval, 3 fails): detect in 15s.
- Conservative (30s, 3 fails): detect in 90s. Too slow for production.
Network throughput
- 100 Gbps NIC bonded × 8 LB instances × 50% utilization = 400 Gbps headroom. Enough for any reasonable workload.
High-level architecture
A modern software load balancer is a userspace proxy with three components: a data plane that handles connections, a control plane that distributes config and health, and an observability plane that emits metrics.
Connections enter via a VIP (virtual IP) shared across LB instances - either by anycast routing or a virtual IP managed by the cloud provider (ALB, NLB, GCP LB). Traffic is distributed to LB instances by L4 hashing (5-tuple) or by ECMP routing.
Each LB instance terminates the connection (L4 or L7 depending on type), looks up the backend pool for the destination route, picks a backend per the load balancing algorithm, and forwards.
The defining engineering challenge: keeping configuration consistent across LB instances while serving traffic continuously. A new backend added to the pool should start receiving traffic across all instances within seconds; a failed backend should stop receiving traffic within seconds; in-flight connections to draining backends should complete cleanly.
VIP / front-end IP
Stable IP that clients connect to. Backed by anycast (cloud LBs), keepalived/VRRP (on-prem), or BGP. Failover to surviving LB instances on any single failure.
L4 distributor
First-touch layer that hashes the 5-tuple (src ip+port, dst ip+port, proto) to pick an LB instance. Stateless. Used in NLB-style architectures.
L7 proxy (Envoy / HAProxy)
Userspace process. Handles TLS termination, HTTP parsing, route matching, header manipulation, retries. The 'load balancer' as developers think of it.
Backend pool
Set of healthy backends per route. Updated by service discovery + health checks. Each backend has weight, current connection count, last RTT.
Service discovery integration
Reads backend membership from a discovery system (Consul, etcd, Kubernetes endpoints, AWS service discovery). Pushes updates to the LB data plane in seconds.
Health checker
Active probes (HTTP GET /health, TCP connect) and passive probes (count consecutive errors). Removes failing backends; restores on recovery.
TLS termination
Per-listener cert (SNI multiplex). Optional re-encrypt to backend for in-flight encryption.
Connection drain manager
On backend removal, stops new connections but lets in-flight finish (or kills after timeout).
Routing engine
Maps incoming request to backend pool. L7 routing by host header, path, method; L4 routing by destination port. Rules updated via control plane.
Metrics / tracing exporter
Per-route, per-backend metrics (RPS, p50/p99 latency, 4xx/5xx counts). Distributed tracing headers propagated. Drives autoscaling and debugging.
Control plane
Manages config (routes, certs, health policies), pushes to data plane via xDS or similar protocol. Eventually consistent across instances; cluster-wide changes propagate in seconds.
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. L4 vs L7: when to terminate at which layer
The fundamental dichotomy. L4 operates on TCP/UDP - it sees packets, ports, IPs. L7 operates on the application protocol - HTTP, gRPC. The choice constrains everything downstream.
L4 load balancer
Operates at the transport layer. Forwards TCP/UDP packets without parsing them. Picks a backend per connection (not per request); subsequent packets in the same connection go to the same backend.
Pros:
- Transparent to application protocol. Works for HTTPS, gRPC, raw TCP, custom protocols.
- Very fast - microsecond-level overhead. Hardware-accelerated (DPDK, kernel-bypass).
- Preserves client IP if using DSR (direct server return) or proxy protocol.
Cons:
- Can't make routing decisions on URL, header, or method. Routing is by destination port only.
- Can't terminate TLS (without becoming L7).
- Per-connection sticky - long-lived HTTP/2 connections all land on one backend.
Examples: AWS NLB, Google Cloud Network LB, Linux IPVS, Cilium L4LB.
L7 load balancer
Operates at the application layer. Terminates the connection, parses the protocol, makes per-request routing decisions, optionally re-establishes connections to backends.
Pros:
- Routes by URL, host header, method, cookies. Necessary for microservices.
- Terminates TLS centrally.
- Per-request retries, timeouts, circuit breakers.
- Header manipulation, request rewriting.
Cons:
- More CPU per request (~10us overhead).
- Adds a hop (extra TCP connection: client→LB and LB→backend).
- Connection multiplexing changes (one LB→backend conn carries many client requests).
Examples: Envoy, HAProxy, NGINX, AWS ALB, Google Cloud HTTP(S) LB, Traefik.
The hybrid: L4 in front of L7
Common pattern: L4 LB distributes connections across L7 LB instances. L4 handles raw throughput; L7 handles routing.
Why: a single L7 instance might cap at 100K req/sec; an L4 in front spreads load across 8 L7 instances and absorbs DDoS at line rate.
The TLS question
Terminating TLS at L4 means doing it in hardware/kernel - fast but inflexible. Terminating at L7 means userspace TLS - more CPU but allows per-request inspection (WAF, header-based routing).
Modern stack: TLS at L7 with hardware acceleration (AES-NI, AVX-512). Best of both.
The sticky session question
L4 stickiness: per-TCP-connection. Free; just don't move the connection.
L7 stickiness: per-cookie or per-IP-hash. Requires backend selection logic.
For HTTP/1.1 with frequent reconnects, L7 stickiness via cookie is the standard.
The HTTP/2 multiplexing trap
HTTP/2 carries many streams per TCP connection. With L4 stickiness, all streams from one client go to the same backend. With L7, each request can route independently.
Microservice traffic is overwhelmingly HTTP/2 (gRPC). L4-only stickiness defeats the multiplexing benefit. L7 is the right answer.
2. Load balancing algorithms: round-robin, least-connections, consistent hash
The algorithm matters more than the LB tier. Picking wrong creates hotspots that look like LB bugs.
Round-robin (RR)
Cycle through backends in order. Simplest. Each backend gets 1/N of new connections.
Pros: simple, fair on uniform workloads.
Cons: doesn't account for backend differences (CPU, memory, slow query). Slow backend gets the same load as fast - tail latency suffers.
Weighted round-robin
Each backend has a weight. Distribution proportional to weights.
Use case: heterogeneous fleet, gradual rollouts (new version gets 1% weight).
Least connections
Send new request to the backend with the fewest in-flight connections. Self-corrects: a slow backend accumulates connections, gets less new work.
Pros: handles heterogeneous response times naturally.
Cons: requires connection tracking - more state. Can starve a recovering backend (it has 0 connections initially, gets pile-on).
Variant: "least requests" - count requests, not connections. Better for HTTP/2 where one connection holds many in-flight.
Latency-based / EWMA
Track a moving average of latency per backend; route to lowest-latency.
Use case: cross-region routing where some backends are physically slow.
Cons: latency is noisy; needs careful smoothing (EWMA over ~30s).
Power of two choices
Pick 2 backends randomly; route to the less loaded. Provably close to optimal with much less state than full least-connections tracking.
Used in modern proxies as a default. Tail latency drops by ~5x vs round-robin under load asymmetry.
Consistent hashing
Hash the request key (URL, user ID, session ID) to a backend. Same key always routes to the same backend (modulo membership changes).
Use cases:
- Cache-friendly: same key hits same backend's local cache. Hit rate stays high.
- Session affinity without sticky cookies: hash on session ID; backend pool change moves only ~1/N keys.
- Sharded backends (each backend owns part of the key space).
Implementation: ring of virtual nodes (each backend gets ~100 virtual positions). Key hashes to a position; route to the next backend clockwise. Adding a backend redistributes ~1/N of keys; removing redistributes ~1/N back.
Maglev / rendezvous hashing
Variants of consistent hashing with better distribution properties. Maglev (Google's LB) precomputes a permutation table for O(1) lookup. Used at scale where the consistent-hash math overhead matters.
Random
Pick a backend randomly. Surprisingly competitive with round-robin on uniform workloads. Used as a fallback.
Algorithm choice by workload
- Stateless backends, uniform load: round-robin or random.
- Heterogeneous backends or response times: least-connections / power-of-two.
- Cache affinity matters: consistent hash.
- Strict per-key affinity: consistent hash with replication for failover.
3. Health checks: active, passive, and the noisy-neighbor problem
A backend that returns 5xx but stays in rotation cascades errors to users. Health checks decide who's in.
Active health checks
LB sends periodic probes (HTTP GET /health, TCP connect). Backend responds healthy or unhealthy.
Tunables:
- Interval: how often to probe (1-30s). Lower = faster detection, more probe load.
- Timeout: how long to wait for response (1-10s).
- Healthy threshold: N consecutive successes to mark healthy.
- Unhealthy threshold: N consecutive failures to mark unhealthy.
Standard: 5s interval, 2s timeout, 2 healthy, 3 unhealthy. Detect in 15s.
Passive health checks (outlier detection)
Watch real traffic for errors. Backends with >X% errors over a window get ejected.
Pros: works on the actual traffic pattern, not synthetic probes. Catches issues active checks miss (e.g., specific endpoints failing).
Cons: needs sample size; risk of false positives during low traffic.
Typical config: eject if >50% errors in last 30s with min 5 requests; recheck every minute, restore if healthy.
Active + passive are complementary - run both.
The /health endpoint design
Naive: return 200 always. Useless - it doesn't reflect actual readiness.
Better: check the backend's dependencies (DB ping, downstream service ping). Return 503 if any critical dependency is down.
Risk: dependency check makes /health slow or correlates failures (DB blip → all backends unhealthy simultaneously → no backends serve, total outage).
Production pattern: distinguish liveness ("am I alive?") from readiness ("can I serve a request?"). Liveness restarts the process; readiness only removes from LB rotation.
Cascading failure: the "everyone unhealthy" scenario
A shared dependency (database) blips. All backends fail their /health checks. LB removes all from rotation. No backends → 503 to users.
Mitigations:
- Health check should not depend on shared dependencies (or use a fast cached signal).
- Panic threshold: if >50% of backends are unhealthy, ignore health checks and route to all anyway. Better to serve degraded than not at all.
- Decouple deep checks from surface checks - run deep check less often, expose the result via cached endpoint.
Slow start / warmup
A newly-healthy backend gets traffic ramped up over T seconds. Avoids thundering herd on a cold-cache backend.
Implementation: weight grows linearly from 0 to full over warmup window.
The "fluttering" backend
A backend that flips between healthy/unhealthy every minute floods logs and confuses ops. Mitigation: hysteresis - higher threshold to mark healthy than to mark unhealthy. Or rate-limit state changes.
4. Connection draining and graceful shutdown
Removing a backend isn't instant. In-flight requests must complete; new connections must stop.
The naive approach (broken)
Remove backend from rotation. Kill the process. In-flight requests get connection-reset. Users see errors.
Connection draining
- Mark backend as draining.
- LB stops sending new connections (or new requests on existing connections).
- Existing in-flight requests complete normally.
- After drain timeout (typically 30s-5min), backend can shut down.
Drain timeout chosen by p99 of request duration + buffer. Web apps: 30s. Long-lived gRPC: 5+ min.
HTTP/1.1 vs HTTP/2 draining
HTTP/1.1: each request gets its own TCP connection. Drain = stop assigning new connections.
HTTP/2: many streams per connection. New requests can arrive on existing connections. Drain requires:
- LB stops accepting new connections to draining backend.
- LB sends GOAWAY frame on existing HTTP/2 connections, asking client to reconnect.
- Client reconnects; LB routes the new connection elsewhere.
Without GOAWAY, an HTTP/2 client might hold the connection for hours.
Backend-initiated drain
For SIGTERM-based shutdown (Kubernetes pod termination):
- Container receives SIGTERM.
- Backend marks itself unhealthy on /health.
- LB next health-check sees unhealthy, removes from rotation (5-15s lag).
- Backend continues serving in-flight; rejects new with 503 if any sneak in.
- Backend exits when in-flight count = 0 (or after grace period).
Variations: backend pre-emptively closes idle connections; backend signals LB via API to drain immediately.
The 503-during-deploy bug
Common pattern: deploys cause brief 503 spikes because LB hasn't yet marked the new backend healthy / old one drained.
Causes:
- Health check interval too long (15s before LB notices new backend).
- LB config update lag (control plane pushes config slowly).
- Slow start not configured (new backend overwhelmed before warmup).
Fix: shorter health-check intervals during deploys; pre-warmup; rolling deploy with N-at-a-time, not all-at-once.
Stuck connections
A backend that hangs (deadlock, infinite loop) doesn't drain - in-flight requests never complete. Drain timeout fires, LB severs the connection, user sees error.
Mitigations:
- Server-side request timeout. Backend kills its own request after Tmax; drain proceeds.
- Aggressive drain: after timeout, send TCP RST. User loses one request; deploy continues.
Production answer: server-side timeout strictly less than drain timeout.
5. Sticky sessions and the alternative
Sticky sessions tie a client to a backend. Useful for stateful apps; harmful for elasticity.
Why sticky exists
Session state stored in backend memory: shopping cart, partial form, in-progress upload. If the user's next request hits a different backend, the state is gone.
The "right" answer is to externalize state (Redis, DB) so any backend can serve any request. But many legacy apps don't, and rewriting is expensive.
Sticky implementations
- Cookie-based: LB injects (or reads) a cookie identifying the backend. Subsequent requests with the cookie route to the same backend.
- IP-based: hash client IP to backend. Sticks unless IP changes (mobile clients on cellular fail this).
- Header-based: hash a session ID header.
- Connection-based (L4): TCP connection sticks to one backend; HTTP/1.1 keep-alive maintains stickiness within a connection.
Cookie sticky in detail
LB sets a cookie (e.g., AWSALB or app-defined). Cookie value identifies backend. Lifetime usually session-bounded.
Pros: works through NAT, mobile networks. Survives reconnects.
Cons: clients without cookies (curl, bots) lose stickiness. JavaScript can read the cookie - mild privacy concern.
Sticky failure modes
- Backend dies: sticky cookie still points to it. Next request fails. LB must detect, fall back to a healthy backend, possibly issue new cookie.
- Deploys: sticky breaks during rolling restarts. Mitigation: drain + sticky decay.
- Hot user: a high-traffic user is sticky to one backend, can saturate it.
The "consistent hash + state externalization" alternative
Modern pattern: backends are stateless; state lives in a sharded KV store. LB hashes the request key to a backend - same backend each time, but if the backend dies, traffic moves to a neighbor without losing state.
Combined with consistent hashing on the LB, the cache locality benefit of sticky is preserved without the failure-mode pain.
When you actually need sticky
- Legacy apps with backend-memory state. Migrate to externalized state instead if possible.
- WebSocket / long-poll connections. The LB needs to keep the same TCP connection on the same backend; this is implicit, not "sticky" in the configurable sense.
- Stateful cache locality. Use consistent hash, not sticky cookies.
The interview signal: candidates who explain when not to use sticky score higher than candidates who default to it.
6. Hardware vs software vs cloud-managed: the build vs buy spectrum
The LB market has three tiers, each with sharp trade-offs.
Hardware appliances (F5 Big-IP, Citrix NetScaler, A10)
Dedicated boxes with custom ASICs. Multi-Gbps line-rate processing.
Pros:
- Highest throughput (40-400 Gbps per box).
- Battle-tested for decades.
- Strong vendor support (paid).
Cons:
- Capex - single boxes cost $50K-$500K+.
- Static capacity - hard to scale up.
- Vendor lock-in. Config languages are proprietary.
- Slow innovation - features lag cloud-native by years.
Use case: regulated industries, on-prem deployments, ultra-high throughput.
Software LBs (HAProxy, NGINX, Envoy)
Userspace processes on commodity Linux servers.
Pros:
- Commodity hardware. Scale by adding servers.
- Open source; free or cheap.
- Rich features. Envoy in particular is the de-facto standard for service-mesh data planes.
- Updateable on your own schedule.
Cons:
- Lower per-instance throughput (~100K-200K req/sec). Needs horizontal scaling.
- Operational burden: deploy, upgrade, monitor yourself.
- TLS / crypto in software costs CPU (mitigated by AES-NI).
Use case: most production deployments. Service meshes (Istio, Consul Connect) standardize on Envoy.
Cloud-managed (AWS ALB / NLB, GCP LB, Azure LB)
Pay-per-use. No infra to manage.
Pros:
- No ops burden.
- Auto-scales transparently.
- Integrated with cloud features (IAM, security groups, autoscaling groups).
- Managed TLS via ACM.
Cons:
- Cost scales with traffic (and can surprise at scale).
- Less flexibility - cloud-defined feature set.
- Vendor lock-in.
- Sometimes weird limits (e.g., ALB max connections, certificate per-LB caps).
Use case: startups and most cloud-native apps. Default choice.
The progression
- Cloud-managed (ALB/NLB) for the public edge. Cheap to start, scales automatically.
- Software LB inside the VPC for service-to-service (Envoy in a service mesh). Richer features, lower cost than per-call ALB.
- Hardware only in specialized cases (on-prem datacenters, regulated industries, multi-Tbps loads).
The cost curve
Cloud LB pricing: ~$0.025/hr + $0.008/LCU. At 10K req/sec sustained, ~$1K/month per ALB. At 100K req/sec, ~$10K/month.
Self-hosted Envoy on a c6i.4xlarge: $500/month, handles 100-200K req/sec. 20x cheaper at scale.
Crossover point: ~10K-20K req/sec sustained. Below: managed wins. Above: self-hosted (with HA) wins.
Service mesh as an LB layer
Modern microservices push LB into a per-service sidecar (Envoy, Linkerd). Each service has a dedicated proxy that handles its outbound traffic. Net effect: every service-to-service call goes through an Envoy → consistent observability, retries, circuit breakers, mTLS.
The "LB tier" disappears as a separate concern - it's now ambient infrastructure.
Trade-offs
L4 vs L7
L4 is faster and protocol-agnostic; L7 enables URL/header routing and TLS termination. Most production has L4 in front of L7 - the L4 tier handles raw throughput; L7 handles routing.
Algorithm choice
Round-robin for uniform; least-connections / power-of-two for heterogeneous; consistent hash for cache-affine. Pick based on the actual workload, not "what feels safest".
Sticky sessions vs externalized state
Sticky is a workaround for backend-memory state. Externalize state when possible - sticky breaks on backend failure and complicates scaling.
Active vs passive health checks
Active catches failures cleanly; passive catches issues active misses (specific endpoint errors). Run both; passive is more important than people think.
Connection drain timeout
Too short: in-flight requests dropped. Too long: deploys take forever. Set to p99 request duration + buffer.
Hardware vs software vs managed
Managed for low scale and simplicity; software for flexibility and cost at scale; hardware only for specialized cases. Most companies start managed and add software for service-to-service.
TLS at LB vs end-to-end
LB termination centralizes cert management; end-to-end is required for compliance / zero-trust. Modern stacks do LB→backend re-encrypt for both.
Per-LB-instance failover via VIP vs DNS
VIP (anycast / VRRP) is instant but complex. DNS is simple but has TTL-bound failover (~60s typical). VIP is the standard for production; DNS is acceptable for low-stakes.
Aggressive vs conservative health checks
Aggressive (1s interval, 2 fails) detects fast but flaps on transient blips. Conservative (10s, 3 fails) is stable but slow. Standard: 5s, 3 fails - middle ground.
Service mesh sidecar vs central LB
Sidecar (Envoy per pod) has per-service control + telemetry; high resource overhead per pod. Central LB shares resources but loses per-service granularity. For service-to-service, sidecars are winning; for ingress, central remains.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How would you implement zero-downtime LB upgrades?
- ?What changes if 90% of traffic is HTTP/2 with long-lived connections?
- ?How do you load-balance across regions with active-active?
- ?What's your strategy when a backend has a slow response that ties up connections?
- ?How would you protect a fleet from one noisy tenant via the LB?
- ?How do you migrate from sticky-cookie sessions to externalized state without dropping users?
- ?What's your plan when the LB itself becomes a bottleneck?
- ?How would you implement a canary deploy via traffic-weighted routing?
Related system design topics
Rate Limiter
MediumFive algorithms, three sharding strategies, one fail-open vs fail-closed decision. The bounded design that surfaces in every backend interview loop.
CDN + Edge
HardEdge cache hierarchies, cache key design, invalidation, origin shield, and edge compute - the system every other system relies on without thinking about it.
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →