Distributed Systems

Design a Rate Limiter (API Throttling)

MediumFREE

Five algorithms, three sharding strategies, one fail-open vs fail-closed decision. The bounded design that surfaces in every backend interview loop.

The problem

Design a rate limiter: a service that decides whether an incoming request from a given client (user, API key, IP) should be allowed or rejected based on a configured rate. Common rules: 100 requests per minute per user, 10 requests per second per API key with a 50-burst allowance, or per-tenant tiered limits.

Rate limiters show up in two interview flavors. The applied flavor: implement a single-process rate limiter (bug-squash style). The system design flavor: design a distributed rate limiter that protects a fleet of services across regions. This walkthrough is the system-design flavor. Both share the algorithmic core.

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

→What's the granularity - per user, per API key, per IP, per tenant, per (user, endpoint) tuple?
→Is this protecting one service or a fleet behind a shared gateway?
→What are the rate rules - fixed RPM, RPS with burst, tiered by plan, dynamic per-endpoint?
→What happens on rate-limit hit - 429 response, queue and delay, fail-open?
→How critical is exact accuracy - is allowing 102 requests in a 100/min window catastrophic, or acceptable?
→What's the failure model - if the rate-limit store is unreachable, do we fail open (allow all) or fail closed (deny all)?
→Are limits enforced at the edge (CDN/gateway) or per-service?
→Multi-region - does a user's 100/min apply globally or per-region?

Requirements

Functional requirements

·Decide allow vs deny for each incoming request based on configured rate
·Return current usage state in response headers (X-RateLimit-Remaining, Retry-After)
·Support multiple rules per request (per-user AND per-IP AND per-endpoint)
·Per-tenant tiered limits (free 100/min, pro 1K/min, enterprise 10K/min)
·Dynamic rule updates without restart

Non-functional requirements

Scale: 1M rps aggregate across all rules. 100M unique rate-limit keys (user_ids × API keys × IP buckets) at peak. 10K rules in active configuration.
Latency: Decide p99 < 5ms. The rate limiter sits on the user-facing path - it's pure overhead, so latency budget is brutal.
Availability: 99.999%. The rate limiter must be more available than the service it protects, otherwise it becomes the bottleneck. Practically: fail-open mode + stateless decision in degraded paths.
Consistency: Approximate is fine. Allowing 105 requests in a 100/min window during a network partition is acceptable; blocking legitimate traffic during a partition is not.

Capacity estimation

Memory

100M active keys × ~100 bytes each (counter + timestamp) = 10 GB. Fits in memory across a small Redis cluster.
For sliding-window log algorithm: ~1KB per active key (last N timestamps). 100 GB. Not memory-bound but starting to matter.

Throughput

1M rps decisions. Each decision = 1 read + 1 write (atomic INCR / sliding window add) on the limit store.
Net store ops: 2M ops/sec. Distributed across 20-50 Redis nodes at 100K ops/sec each.

Network

Decision RPC: tiny (a few hundred bytes). 1M rps × 200B = 200 MB/s aggregate. Fits in a single 10 Gbps NIC at the limit-store tier.

Latency budget breakdown

Network round-trip in-region: ~0.5ms.
Limit store op: ~0.5-1ms.
Decision logic: <0.1ms.
Total: ~2-3ms. Leaves headroom for the 5ms target.

High-level architecture

Three deployment patterns, each with sharply different trade-offs.

Pattern 1: Edge / gateway (the dominant pattern)
Rate limiter lives in the API gateway (Envoy, Kong, NGINX, AWS API Gateway). Every request hits the gateway, gateway makes the decision, request flows through or 429s. Centralized state in Redis or DynamoDB. Best for: most use cases. The gateway already terminates TLS and routes; rate limiting fits naturally.

Pattern 2: Per-service sidecar
Each service has a sidecar enforcing rate limits. State is local + occasionally synced. Best for: internal service-to-service rate limiting where the gateway is bypassed.

Pattern 3: Library in-process
Each service binary embeds a rate-limit library. State is local, eventually synced via gossip or a central coordinator. Best for: highest-performance use cases where even gateway latency is too much (HFT, ad serving).

Recommendation: edge gateway pattern. Mention the others as context.

API Gateway

Receives every request, calls rate-limit decision service, forwards or 429s. Sets X-RateLimit-* response headers.

Rate-limit decision service

Stateless. Looks up applicable rules for the request, calls limit store with key + algorithm, returns allow/deny.

Rule registry

Stores active rules (per-tenant limits, per-endpoint overrides). Hot-reloaded from a config service. Cached locally with short TTL.

Limit store (Redis Cluster)

Atomic INCR / sliding window add. Sharded by rate-limit key (user_id, API key, etc.). Source of truth for counters.

Tier classifier

Maps incoming request to (user_id, tier, applicable rules). Lookup against the user/tenant DB, cached aggressively.

Async telemetry pipeline

Emits rate-limit decisions to an analytics topic (allowed, denied, near-limit). Drives dashboards and alerts.

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. Algorithm choice: 5 options, three real winners

Token bucket (the right default)
Each key has a bucket of N tokens, refilled at R tokens/second. Each request consumes 1 token. If the bucket is empty, deny. Pro: handles bursts naturally (full bucket = burst capacity). Pro: simple to implement. Pro: smooth refill. Used by AWS API Gateway, Stripe, Cloudflare.

Implementation in Redis:

Store: (last_refill_timestamp, current_tokens) per key.
On request: compute tokens = min(capacity, current + (now - last_refill) × R). If tokens >= 1, decrement, allow. Else deny.
Atomic via Lua script.

Leaky bucket
Requests fill a bucket; bucket drains at constant rate. If the bucket overflows, deny. Functionally similar to token bucket but enforces a smooth output rate (good for protecting downstream systems from bursts). Used in network traffic shaping.

Fixed window counter
Count requests in 1-minute buckets. Reset to 0 each minute. Pro: trivial (single INCR). Con: 2x burst at window boundaries (100 at 0:59 + 100 at 1:00 = 200 in 1 second). Use only for coarse limits where this is acceptable.

Sliding window log
Store a timestamped log of every request per key. On new request, count entries in [now - window, now]. Pro: exact. Con: memory-heavy (store every request).

Sliding window counter (the practical exact-ish answer)
Combine fixed-window counter with weighted contribution from the previous window. count = current_window × elapsed_in_window + previous_window × (1 - elapsed_in_window). Pro: O(1) memory. Pro: avoids the 2x burst of fixed window. Pro: ~99% accurate. Used by Cloudflare, Snowflake (in places).

Recommendation in interviews: token bucket as the default; sliding window counter when you need closer-to-exact accuracy without per-request log overhead.

2. Distributed coordination: how to keep counters consistent across nodes

Naive single Redis instance: works until the instance falls over. The hard problem is sharing state across multiple decision-service nodes globally.

Approach 1: Centralized store (the simple right answer)
All decision nodes hit a shared Redis Cluster. Sharded by rate-limit key. INCR is atomic per key. Each rate-limit key lives on exactly one Redis shard.

Latency: 1-2ms per decision (extra round trip to Redis).
Failure: shard failure → keys on that shard are unenforced. Mitigation: replication.
Cost: 50K keys × 1KB × 3x replication ≈ 150 MB. Trivial.

Approach 2: Local + async reconcile
Each decision node maintains its own counter. Periodically (every 100ms), nodes broadcast deltas to a central aggregator. Aggregator merges, redistributes. Works for relaxed accuracy.

Latency: 0ms (purely local).
Drift: counters can diverge by 100ms × node_count × per-node-rate. For 100 nodes at 10K rps, that's 1M requests of drift per cycle - too much for tight limits.

Approach 3: Predictive + central reconcile
Each node gets a per-node "budget" from the central aggregator (e.g., 1/N of the global limit). Spends locally. When budget runs low, requests more. This is how AWS API Gateway scales.

Latency: 0ms in steady state.
Drift: bounded by per-node budget size.
Best for high-RPS limits where the centralized round-trip is too slow.

For interview: lead with centralized Redis (approach 1). Mention approach 3 as the "if we're at 10M+ rps and the round-trip is too expensive" answer.

Multi-region nuance: a global "100 rps per user" limit across regions requires either centralized state in one region (added latency) or accepting per-region limits that sum to the global cap. Most production systems accept per-region limits.

3. Fail-open vs fail-closed: the operational decision

When the limit store is unreachable, the decision service has two options.

Fail-open
Allow the request. The service degrades to no rate limiting but stays responsive.

Pro: protects user experience during a limit-store outage.
Pro: doesn't cascade a limit-store outage into service unavailability.
Con: malicious or buggy clients can exploit the outage to flood the service.

Fail-closed
Deny the request. The service degrades to "rate limit cannot be verified, denying for safety."

Pro: deterministic behavior.
Con: a 5-second limit-store blip turns into a 5-second outage of every protected endpoint. Cascading failure.

The right answer: fail-open by default for protective rate limiting (DDoS-style), fail-closed for billing-critical rate limiting (where exceeding the limit costs the company money).

Hybrid: local fallback
Each decision node maintains a small in-memory cache of "last known state" per key. If the limit store is unreachable, consult the local cache for an approximate decision. This catches the worst-case (user was at 99/100 a second ago, probably still over) without the full fail-closed cost.

Operational signal: monitor the rate of "store unreachable, falling back" events. A sudden spike means your limit store is degrading even if it hasn't fully failed.

4. Tiered limits and dynamic rule loading

Real APIs have plans (free, pro, enterprise) and per-endpoint limits (POST /charges has a different limit than GET /balance).

Rule structure

rule {
  name: "free_tier_global"
  match: { plan: "free" }
  limit: { rate: 100, period: "1m", burst: 20 }
}
rule {
  name: "free_tier_charges"
  match: { plan: "free", endpoint: "/charges" }
  limit: { rate: 10, period: "1m", burst: 5 }
}

A request matches multiple rules (free tier global + free tier charges); each rule's counter is consumed; if any rule denies, the request is denied. Return the strictest applicable limit's headers.

Hot-loading rules
Rules live in a config service (Consul, etcd, or a dedicated config DB). Decision service polls every 30s for updated rules. Local cache + ETag avoids re-fetching unchanged rules.

A/B testing limits
Want to test a tighter limit on 1% of users? Add a rule with a "user_id_hash mod 100 < 1" predicate. The decision service evaluates predicates in priority order; first match wins.

Per-endpoint overrides
Most endpoints share the global limit; expensive endpoints (search, export) get tighter limits. Define overrides as "additional" rules, not "replacements" - the global limit still applies, plus the endpoint-specific cap.

Interview signal: candidates who structure rules as a small DSL (not hard-coded if-else) score higher because it shows operational maturity.

5. Headers, retry-after, and client behavior

Rate limiters interact with clients - response design matters.

Standard headers

X-RateLimit-Limit: the configured rate (e.g., 100).
X-RateLimit-Remaining: requests remaining in the current window.
X-RateLimit-Reset: epoch timestamp when the window resets.
Retry-After: seconds the client should wait before retrying. Set on 429 responses.

Status codes

429 Too Many Requests: standard. Include Retry-After.
503 Service Unavailable: only for fail-closed scenarios where the limiter itself is down.

Client-side behavior
Well-behaved clients respect Retry-After and back off. Misbehaving clients ignore it and hammer. The server must protect itself regardless.

Exponential backoff with jitter
Recommended client retry: wait Retry-After × (1 + random(0, 1)). Without jitter, multiple clients retry in lockstep and re-create the spike.

Webhooks and async clients
For webhooks (server-to-server), the receiver can return 429 to signal the sender to slow down. The sender must implement retry-with-backoff per the spec. Stripe's webhook delivery system implements this.

The interview gotcha: candidates often forget that the rate limiter still consumes a slot for the request that gets rate limited. Decide consciously - some limiters count denied requests against the limit (more aggressive); some don't (more forgiving).

Trade-offs

Algorithm complexity vs accuracy
Token bucket is simple and handles bursts naturally - the right default. Sliding window counter is more accurate but more complex. Sliding window log is exact but memory-heavy. Pick the simplest algorithm that meets your accuracy needs.

Centralized vs distributed counters
Centralized (single Redis cluster) is simple but adds 1-2ms per decision. Distributed (per-node budgets) is fast but introduces drift. At 10M+ rps the round-trip cost forces distributed; below that, centralized is the right answer.

Fail-open vs fail-closed
Fail-open during limit-store outages preserves availability at the cost of brief enforcement gaps. Fail-closed preserves enforcement at the cost of cascading availability impact. Default to fail-open with a local-cache fallback for known-hot keys.

Strict vs approximate accuracy
Some rules can tolerate ~5% over-enforcement (DDoS protection, abuse limits). Others can't (billing, compliance). Match the algorithm and store consistency to the rule's accuracy requirement, not to "we always need exactness."

Edge vs per-service
Edge rate limiting catches everything at the front door, simpler to operate, easier to share state. Per-service rate limiting catches internal traffic and offers finer control. Most systems do edge + a thin per-service layer for internal calls.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

?How would you handle a 100x traffic spike (DDoS) - does the rate limiter survive?
?What changes if the same user's traffic must be coordinated across 5 regions?
?How do you support a 'burst plus sustained' limit (1000 burst, 100/sec sustained)?
?What's your monitoring story - what alerts on rate-limit dysfunction?
?How would you migrate from fixed window to sliding window without dropping requests?
?How do you prevent a single tenant from exhausting the limit-store cluster's capacity?
?What happens when rules are misconfigured - what's the safety net?
?How would you implement 'dynamic limits' that auto-adjust based on system load?

Companies that test this topic

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →