Design a Ride-Share Dispatch (Uber / Lyft)
Geo-indexing, real-time matching, ETA prediction, and surge. The canonical geo-spatial design problem with hard real-time constraints.
The problem
Design the dispatch backend for a ride-share product. Riders open the app and request a ride; the system finds nearby drivers, picks one, notifies driver and rider, tracks the trip in real time, completes payment.
This is the canonical geo-spatial system design question. Strong candidates demonstrate fluency in spatial indexing (geohash, S2, H3), real-time location updates, matching algorithms, and the operational layer (ETAs, surge, fraud).
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →What's the scale - cities, drivers per city, ride requests per second at peak?
- →Are we designing dispatch only, or also routing/navigation/payment/maps?
- →Single product (one ride type) or multi-product (UberX, Pool, Eats)? Pool changes the matching algorithm dramatically.
- →How fresh do driver locations need to be - 1 second, 5 seconds, 30 seconds?
- →What's the matching objective - shortest ETA, balance driver utilization, surge pricing optimization?
- →What's the scope of the ride - city-only, intercity, airport runs?
- →Failure model - what if dispatch is down for 30 seconds in a major city?
- →Is this for a generic ride-share, food delivery, or both? Last-mile delivery has different constraints.
Requirements
Functional requirements
- ·Drivers upload their location every few seconds while online
- ·Rider requests ride with pickup + dropoff
- ·System finds candidate drivers near pickup
- ·Match a single driver, notify both parties
- ·Track trip in real time; share location with rider
- ·Compute fare, run payment, surface to both apps
- ·Provide ETA estimates pre-request and during pickup
Non-functional requirements
- Scale
- 10M concurrent drivers globally. 100K ride requests per second at peak. 1B trips per day. Average trip: 15 minutes. 500 cities.
- Latency
- Match decision p99 < 2 seconds (rider is staring at the spinner). Driver location updates: tolerable to be 3-5 seconds stale. Trip status updates to rider: < 1 second after change.
- Availability
- 99.99% in active markets. A 5-minute dispatch outage during commute hours is a major incident.
- Consistency
- Eventual consistency on driver locations (stale by seconds). Strong consistency on match assignment (a driver is matched to exactly one rider; no double-booking).
Capacity estimation
Driver location updates
- 10M drivers × 1 update / 4 seconds = 2.5M updates/sec sustained.
- Each update: lat/lon + driver_id + status = ~100 bytes. Network: 250 MB/s ingress.
- Storage: hot (last 24 hours) for analytics replay. 2.5M × 100B × 86400 = ~22 TB/day. Tier old data to S3.
Match queries
- 100K rps requests, each does a spatial query: "drivers within 5km of (lat, lon)." Each query touches ~10-100 candidate drivers depending on density.
- Spatial index reads: 10M reads/sec. Memory-resident structure.
Storage
- Trip records: 1B trips/day × 5KB metadata = 5 TB/day. Multi-year retention with tiering.
- Driver state (online/offline, available/on-trip): 10M rows, ~1KB each = 10 GB. Hot KV store.
Maps and routing
- Routing graphs (road networks) per city: 1-10 GB each. Total: ~5 TB across 500 cities.
- Routing engine queries: 100K rps + plus ETA queries during trips.
High-level architecture
Two real-time tiers and one supporting tier.
Real-time location ingestion: drivers stream locations to the system every few seconds. Locations index into a spatial structure for nearby-search.
Real-time matching: rider requests match → spatial query → candidate drivers → ranking → assignment → notification.
Operational support: trip lifecycle (start, in-progress, end), pricing (surge), payment, ETA service, fraud detection, dispatcher analytics.
The hardest piece is the spatial indexing layer because it's both write-heavy (driver updates) and read-heavy (match queries) with low latency on both.
Driver location gateway
Receives driver location updates over persistent connection (WebSocket or QUIC). Validates, forwards to spatial index update pipeline.
Spatial index (geohash / S2 / H3)
In-memory sharded index of driver_id → location, indexed by spatial cell. Enables 'find drivers in cells covering 5km radius around (lat, lon).' Sharded by spatial region.
Match service
Stateless. Given a ride request: query spatial index for candidates, rank by ETA + driver acceptance probability + surge optimization, propose match to driver.
Trip orchestrator
Tracks per-trip state machine (requested, assigned, picking up, in-trip, completed). Strongly consistent (DynamoDB conditional writes). Source of truth.
ETA service
Real-time ETA computation. Uses live traffic, road graph, historical times. Called pre-request, during match selection, and continuously during pickup.
Routing engine
Per-city road graphs. Computes shortest-path with current traffic. Heavy precomputation (contraction hierarchies) + online traffic updates.
Pricing / surge service
Per-region demand vs supply ratio. Computes surge multipliers. Updates drive surge map shown to drivers and riders.
Notification service
Push notifications + in-app sockets to rider and driver. Match found, driver arriving, trip status updates.
Payment + ledger
Strongly consistent ledger of trip charges. Asynchronous payment processing post-trip. Decoupled from dispatch hot path.
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. Spatial indexing: geohash vs S2 vs H3
The central data structure is a spatial index that supports "find all drivers within R meters of (lat, lon)" in milliseconds.
Geohash
Encode (lat, lon) as a base32 string where each character refines the cell. Prefix-shared strings = nearby cells.
- Pro: simple, sortable, works in any KV store.
- Con: cell shapes are rectangular and distorted near the poles. "Nearby cell" can be far away across hash-prefix boundaries.
- Used by early Uber, many tutorials.
S2 (Google)
Project Earth onto an inscribed cube; subdivide each face into a quad-tree. Each cell = 64-bit S2 cell ID.
- Pro: spherical correctness. Cells nest hierarchically (multi-resolution queries).
- Con: complex math. Cell shapes still aren't quite uniform.
- Used by Google Maps, Pokemon Go, MongoDB geo features.
H3 (Uber)
Hexagonal grid. Each cell is a regular hexagon at the chosen resolution.
- Pro: hexagons have 6 equidistant neighbors (squares have 4 close + 4 diagonal). Better-shaped neighborhoods.
- Pro: open source, well-documented, used in production at Uber.
- Con: more complex than geohash.
- Used by Uber, modern ride-share systems, FB.
Recommendation: H3 for new systems. Geohash for the simplest possible answer. S2 if you're integrating with Google's stack.
Index implementation
- Hot: cell_id → set of driver_ids. Redis Cluster, sharded by cell. Updates on every driver location refresh.
- Match query: compute the set of cells overlapping the search circle (k-ring), union the drivers in those cells, distance-filter, rank.
Sharding strategy
Shard the index by spatial region (city, country). All drivers in a city sit on the same cluster. Cross-shard queries are rare (airport pickups crossing boundaries).
Resolution choice
Match resolution to typical search radius. H3 res-9 hexagon is ~150m across - good for dense urban areas. Lower resolutions (larger cells) for rural areas.
2. Driver location updates: write throughput
2.5M location updates per second is heavy. Naive "update one row in DB per update" doesn't scale.
Update path
- Driver app sends update over persistent connection (WebSocket or HTTP/2 stream) every 4-8 seconds.
- Gateway receives, validates (driver authenticated, location plausible).
- Forward to spatial-index updater for the driver's region.
- Updater removes driver from old cell, adds to new cell. Atomic in Redis (Lua script).
Write batching
At 2.5M updates/sec, updating 2.5M Redis keys/sec needs ~50-100 Redis nodes. Batch by cell - if many drivers update at once, batch their cell-membership changes into a single MULTI/EXEC. Cuts ops by 5-10x.
Conflation
If a driver sends multiple updates in flight (network jitter), only the latest matters. Conflate at the gateway: keep the latest in-memory, write at fixed rate (2 Hz). Reduces tail-load on Redis.
Persistence vs ephemeral
The spatial index is rebuildable from current driver state. Don't persist every update. Persist a snapshot every minute for fast recovery. Crash recovery: replay live updates from connected drivers.
Driver going offline
Active flag with TTL. Each location update refreshes the TTL (60 seconds). No update for 60 seconds → driver is removed from the index automatically. No explicit "offline" message required.
Privacy
Driver locations are PII. Encrypted in flight. Aggregated location (heatmaps) for analytics; raw location only retained as long as needed for support / forensics. Don't log raw locations to wide-access systems.
3. Matching algorithm: closest ETA isn't always the answer
Naive: pick the driver with the shortest ETA. Real systems optimize multiple objectives.
ETA-only matching
Pick driver minimizing ETA-to-pickup. Simple. Works as v1.
- Problem: ignores driver utilization. A nearby driver might be about to drop off; a slightly farther driver might be free now.
- Problem: can starve drivers in low-density areas (always picking the closest creates clustering).
Multi-factor ranking
Score = w1·(-ETA) + w2·driver_acceptance_rate + w3·driver_idle_time + w4·rider_score + w5·surge_optimality.
- Acceptance rate: drivers who routinely decline waste rider time.
- Idle time: prefer drivers waiting longer (fairness).
- Surge optimality: in surge zones, prefer drivers who can complete this trip and reposition for more surge work.
Pool / shared rides
Match multiple riders to the same driver if their routes overlap.
- Constraint: detour for each rider < threshold (15% of trip distance, 5 minutes ETA).
- Constraint: vehicle capacity.
- Optimization: maximize platform revenue / minimize total distance / minimize per-rider ETA increase.
- Algorithm: variant of dynamic ridesharing or insertion heuristics. Computationally hard at scale.
Batching matches
Naive matching is per-request. Better: collect requests in 1-2 second windows, solve a batch optimization over (riders × candidate drivers).
- Allows globally better matches.
- Adds 1-2 seconds latency. Only used for non-time-critical products.
Match acceptance
Driver might decline. Two strategies:
- Sequential: offer to driver 1; if declined, offer to driver 2; etc.
- Parallel: offer to top-K drivers; first accept wins.
- Parallel is faster for the rider but creates "phantom dispatches" for drivers who decline.
Real systems: sequential by default, with parallel for low-acceptance markets.
4. ETA prediction: hard ML problem in disguise
ETA appears multiple times: pre-request preview, during match selection, during pickup, during trip. Each is a query against a real-time ML system.
Components of ETA
- Path: shortest route under current traffic.
- Travel time: per-edge time estimates from current speed observations.
- Pickup time: driver-to-pickup-location, accounting for parking / approach.
- Wait time: driver acceptance + boarding overhead.
Routing graph
Per-city road network. Nodes = intersections, edges = road segments. Edge weight = expected travel time at time t.
- Precompute: contraction hierarchies for fast shortest-path queries (~ms).
- Live: update edge weights from current traffic observations every minute or so.
Live traffic input
Speed observations from in-trip drivers ("probe vehicles"). At 10M concurrent drivers globally, the platform has the densest real-time traffic data anywhere.
ML refinement
Pure shortest-path × current speeds is a baseline. ML adds:
- Time-of-day effects (rush hour patterns).
- Day-of-week patterns.
- Weather signals.
- Event signals (game ending, concert).
- Pickup-location-specific ETAs (this address always takes 60s extra because of parking).
Continuous re-estimation
During pickup, ETA is recomputed every 30 seconds and pushed to the rider's app. Sets expectations and reduces cancellations.
ETA accuracy is a product KPI
"Driver arriving in 2 minutes" that turns into 8 minutes destroys trust. Internal accuracy targets: median error < 30 seconds, p95 error < 2 minutes. Monitored continuously.
5. Surge pricing: market-clearing mechanism
Demand spikes (rain, sports event, surge in requests) outstrip supply. Surge pricing rebalances by raising fares to attract more drivers and discourage marginal demand.
Per-cell supply-demand ratio
Compute over ~5-minute rolling window: (active requests in cell) / (available drivers in cell). Above threshold → apply surge multiplier (1.2x, 1.5x, 2x, etc.).
Geographic granularity
Surge applies per cell (H3 resolution-7 or similar - few-km hexes). Adjacent cells can have different multipliers; smoothing prevents sharp boundaries that game the system.
Driver and rider visibility
- Driver app shows surge zones as a heatmap. Drivers reposition into surge zones. This is the supply-side response.
- Rider sees the surge multiplier before committing. Some riders cancel; some pay.
Optimization horizon
Naive surge is short-term reactive. Modern systems forecast surge 15-30 minutes ahead and incentivize drivers to position preemptively.
Anti-gaming
- Drivers who hover at surge boundaries to game the system are detected and demoted in matching.
- Coordinated driver "log-off" actions (drivers pretend to log off to spike surge) are detected by analyzing log-off timing patterns.
Fairness considerations
Surge in low-income areas is regulated in some jurisdictions. Some platforms cap surge at certain levels for accessibility services. Surge during emergencies (hurricanes) is typically suspended and pre-positioned drivers compensate via base-fare boosts instead.
The economics
Surge is the price discovery mechanism. It's not just revenue extraction - it actively shifts driver supply to where demand is. Platforms that don't surge end up with hour-long waits.
6. Trip lifecycle and the consistency boundary
Once a match is made, the trip enters a strongly consistent state machine. This is where the dispatch system intersects payment, fraud, and support.
State machine
- requested → assigned → driver_arrived → in_trip → completed → paid
- (Side states: cancelled_by_rider, cancelled_by_driver, no_show)
Each transition is an atomic write to the trip orchestrator's DB (DynamoDB conditional update or relational DB with row lock). Listeners (notifications, payment, analytics) react to state-change events.
Why strong consistency here
The match assignment must be exclusive (one driver, one trip). Two riders being matched to the same driver simultaneously is a P0 bug.
Idempotency
Drivers and riders retry actions (network flakes). Every action carries a client-generated UUID; the orchestrator dedupes. Critical for payment to avoid double-charging.
Out-of-order events
"Driver arrived" and "driver started trip" can arrive out of order (network reordering). State machine validates allowed transitions and rejects invalid ones.
Failure recovery
Trip orchestrator outage during an active trip is bad. Solutions:
- Active replication across multiple regions for trip state.
- Each driver's app maintains optimistic local state and syncs on reconnect.
- Operations dashboard for support to manually resolve stuck trips.
Lost-driver / lost-rider
GPS loss, dead phone. Detect prolonged silence; ping the other party. Allow rider to manually mark "ride completed" if driver is unreachable. Insurance / safety policies kick in.
Fraud detection
Trip patterns that suggest collusion (driver repeatedly accepts trips from a single rider with no actual movement) are flagged for review. Fraud detection runs offline; doesn't slow the dispatch hot path.
Trade-offs
Spatial index choice
H3 is the modern winner for ride-share. Geohash works for the simplest possible system. S2 if you're already in Google's stack. The wrong choice doubles your matching latency or boundary-hop bugs.
Match latency vs match quality
Per-request matching is fast but locally-greedy. Batch matching produces better global solutions but adds 1-2 seconds latency. Most systems run per-request for premium products and small-batch for shared rides.
Driver location update frequency
4 seconds is the production sweet spot. 1 second drains driver phone batteries and overwhelms the location pipeline. 30 seconds makes ETAs inaccurate and matching stale.
ETA accuracy vs compute cost
Pure shortest-path × live traffic is fast and ~85% accurate. ML refinement adds 10-15% accuracy at significant compute cost. Worth it for the user trust win.
Surge: market-clearing vs user backlash
Aggressive surge maximizes short-term revenue and supply rebalancing. It also generates negative press and regulatory attention. Balance the algorithm with user-facing caps and transparent communication.
Multi-product support
A single dispatch backend supporting UberX, Pool, Eats, Reserve adds significant complexity (different objectives, different time horizons). Many platforms run separate dispatchers per product line for simpler service ownership.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How would you handle a city-scale outage (e.g., spatial index for SF goes down)?
- ?What changes for food delivery (Uber Eats) vs ride-share?
- ?How do you support scheduled rides (book for tomorrow at 8am)?
- ?What's your strategy for hot spots (airport pickup queues)?
- ?How would you implement a 'driver heatmap' that shows where to drive for more rides?
- ?How do you detect and prevent driver fraud (fake trips, account-sharing)?
- ?How does the system handle Cross-border rides or roaming drivers?
- ?What changes for a fleet of self-driving vehicles (no driver acceptance step)?
Related system design topics
Distributed Cache
HardConsistent hashing, eviction, replication, and what really happens when a single hot key takes down the cluster.
Rate Limiter
MediumFive algorithms, three sharding strategies, one fail-open vs fail-closed decision. The bounded design that surfaces in every backend interview loop.
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →