We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Five algorithms, three sharding strategies, one fail-open vs fail-closed decision. The bounded design that surfaces in every backend interview loop.
Design a rate limiter: a service that decides whether an incoming request from a given client (user, API key, IP) should be allowed or rejected based on a configured rate. Common rules: 100 requests per minute per user, 10 requests per second per API key with a 50-burst allowance, or per-tenant tiered limits.
Rate limiters show up in two interview flavors. The applied flavor: implement a single-process rate limiter (bug-squash style). The system design flavor: design a distributed rate limiter that protects a fleet of services across regions. This walkthrough is the system-design flavor. Both share the algorithmic core.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Memory
Throughput
Network
Latency budget breakdown
Three deployment patterns, each with sharply different trade-offs.
Pattern 1: Edge / gateway (the dominant pattern)
Rate limiter lives in the API gateway (Envoy, Kong, NGINX, AWS API Gateway). Every request hits the gateway, gateway makes the decision, request flows through or 429s. Centralized state in Redis or DynamoDB. Best for: most use cases. The gateway already terminates TLS and routes; rate limiting fits naturally.
Pattern 2: Per-service sidecar
Each service has a sidecar enforcing rate limits. State is local + occasionally synced. Best for: internal service-to-service rate limiting where the gateway is bypassed.
Pattern 3: Library in-process
Each service binary embeds a rate-limit library. State is local, eventually synced via gossip or a central coordinator. Best for: highest-performance use cases where even gateway latency is too much (HFT, ad serving).
Recommendation: edge gateway pattern. Mention the others as context.
Receives every request, calls rate-limit decision service, forwards or 429s. Sets X-RateLimit-* response headers.
Stateless. Looks up applicable rules for the request, calls limit store with key + algorithm, returns allow/deny.
Stores active rules (per-tenant limits, per-endpoint overrides). Hot-reloaded from a config service. Cached locally with short TTL.
Atomic INCR / sliding window add. Sharded by rate-limit key (user_id, API key, etc.). Source of truth for counters.
Maps incoming request to (user_id, tier, applicable rules). Lookup against the user/tenant DB, cached aggressively.
Emits rate-limit decisions to an analytics topic (allowed, denied, near-limit). Drives dashboards and alerts.
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
Token bucket (the right default)
Each key has a bucket of N tokens, refilled at R tokens/second. Each request consumes 1 token. If the bucket is empty, deny. Pro: handles bursts naturally (full bucket = burst capacity). Pro: simple to implement. Pro: smooth refill. Used by AWS API Gateway, Stripe, Cloudflare.
Implementation in Redis:
Leaky bucket
Requests fill a bucket; bucket drains at constant rate. If the bucket overflows, deny. Functionally similar to token bucket but enforces a smooth output rate (good for protecting downstream systems from bursts). Used in network traffic shaping.
Fixed window counter
Count requests in 1-minute buckets. Reset to 0 each minute. Pro: trivial (single INCR). Con: 2x burst at window boundaries (100 at 0:59 + 100 at 1:00 = 200 in 1 second). Use only for coarse limits where this is acceptable.
Sliding window log
Store a timestamped log of every request per key. On new request, count entries in [now - window, now]. Pro: exact. Con: memory-heavy (store every request).
Sliding window counter (the practical exact-ish answer)
Combine fixed-window counter with weighted contribution from the previous window. count = current_window × elapsed_in_window + previous_window × (1 - elapsed_in_window). Pro: O(1) memory. Pro: avoids the 2x burst of fixed window. Pro: ~99% accurate. Used by Cloudflare, Snowflake (in places).
Recommendation in interviews: token bucket as the default; sliding window counter when you need closer-to-exact accuracy without per-request log overhead.
Naive single Redis instance: works until the instance falls over. The hard problem is sharing state across multiple decision-service nodes globally.
Approach 1: Centralized store (the simple right answer)
All decision nodes hit a shared Redis Cluster. Sharded by rate-limit key. INCR is atomic per key. Each rate-limit key lives on exactly one Redis shard.
Approach 2: Local + async reconcile
Each decision node maintains its own counter. Periodically (every 100ms), nodes broadcast deltas to a central aggregator. Aggregator merges, redistributes. Works for relaxed accuracy.
Approach 3: Predictive + central reconcile
Each node gets a per-node "budget" from the central aggregator (e.g., 1/N of the global limit). Spends locally. When budget runs low, requests more. This is how AWS API Gateway scales.
For interview: lead with centralized Redis (approach 1). Mention approach 3 as the "if we're at 10M+ rps and the round-trip is too expensive" answer.
Multi-region nuance: a global "100 rps per user" limit across regions requires either centralized state in one region (added latency) or accepting per-region limits that sum to the global cap. Most production systems accept per-region limits.
When the limit store is unreachable, the decision service has two options.
Fail-open
Allow the request. The service degrades to no rate limiting but stays responsive.
Fail-closed
Deny the request. The service degrades to "rate limit cannot be verified, denying for safety."
The right answer: fail-open by default for protective rate limiting (DDoS-style), fail-closed for billing-critical rate limiting (where exceeding the limit costs the company money).
Hybrid: local fallback
Each decision node maintains a small in-memory cache of "last known state" per key. If the limit store is unreachable, consult the local cache for an approximate decision. This catches the worst-case (user was at 99/100 a second ago, probably still over) without the full fail-closed cost.
Operational signal: monitor the rate of "store unreachable, falling back" events. A sudden spike means your limit store is degrading even if it hasn't fully failed.
Real APIs have plans (free, pro, enterprise) and per-endpoint limits (POST /charges has a different limit than GET /balance).
Rule structure
rule {
name: "free_tier_global"
match: { plan: "free" }
limit: { rate: 100, period: "1m", burst: 20 }
}
rule {
name: "free_tier_charges"
match: { plan: "free", endpoint: "/charges" }
limit: { rate: 10, period: "1m", burst: 5 }
}
A request matches multiple rules (free tier global + free tier charges); each rule's counter is consumed; if any rule denies, the request is denied. Return the strictest applicable limit's headers.
Hot-loading rules
Rules live in a config service (Consul, etcd, or a dedicated config DB). Decision service polls every 30s for updated rules. Local cache + ETag avoids re-fetching unchanged rules.
A/B testing limits
Want to test a tighter limit on 1% of users? Add a rule with a "user_id_hash mod 100 < 1" predicate. The decision service evaluates predicates in priority order; first match wins.
Per-endpoint overrides
Most endpoints share the global limit; expensive endpoints (search, export) get tighter limits. Define overrides as "additional" rules, not "replacements" - the global limit still applies, plus the endpoint-specific cap.
Interview signal: candidates who structure rules as a small DSL (not hard-coded if-else) score higher because it shows operational maturity.
Rate limiters interact with clients - response design matters.
Standard headers
X-RateLimit-Limit: the configured rate (e.g., 100).X-RateLimit-Remaining: requests remaining in the current window.X-RateLimit-Reset: epoch timestamp when the window resets.Retry-After: seconds the client should wait before retrying. Set on 429 responses.Status codes
Client-side behavior
Well-behaved clients respect Retry-After and back off. Misbehaving clients ignore it and hammer. The server must protect itself regardless.
Exponential backoff with jitter
Recommended client retry: wait Retry-After × (1 + random(0, 1)). Without jitter, multiple clients retry in lockstep and re-create the spike.
Webhooks and async clients
For webhooks (server-to-server), the receiver can return 429 to signal the sender to slow down. The sender must implement retry-with-backoff per the spec. Stripe's webhook delivery system implements this.
The interview gotcha: candidates often forget that the rate limiter still consumes a slot for the request that gets rate limited. Decide consciously - some limiters count denied requests against the limit (more aggressive); some don't (more forgiving).
Algorithm complexity vs accuracy
Token bucket is simple and handles bursts naturally - the right default. Sliding window counter is more accurate but more complex. Sliding window log is exact but memory-heavy. Pick the simplest algorithm that meets your accuracy needs.
Centralized vs distributed counters
Centralized (single Redis cluster) is simple but adds 1-2ms per decision. Distributed (per-node budgets) is fast but introduces drift. At 10M+ rps the round-trip cost forces distributed; below that, centralized is the right answer.
Fail-open vs fail-closed
Fail-open during limit-store outages preserves availability at the cost of brief enforcement gaps. Fail-closed preserves enforcement at the cost of cascading availability impact. Default to fail-open with a local-cache fallback for known-hot keys.
Strict vs approximate accuracy
Some rules can tolerate ~5% over-enforcement (DDoS protection, abuse limits). Others can't (billing, compliance). Match the algorithm and store consistency to the rule's accuracy requirement, not to "we always need exactness."
Edge vs per-service
Edge rate limiting catches everything at the front door, simpler to operate, easier to share state. Per-service rate limiting catches internal traffic and offers finer control. Most systems do edge + a thin per-service layer for internal calls.
Be ready for at least three of these. The first one is almost always asked.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →