SRE interviews are different from generic backend or DevOps interviews. The role exists because someone needs to be the adult in the room when a database melts at 2am. Hiring managers screen for that disposition - calm under pressure, deep systems knowledge, and the willingness to say "let's go look at the data" instead of guessing.
This guide covers the 50 questions that show up most often in real SRE loops in 2026. They're grouped by the four sections most companies use: reliability fundamentals, distributed systems and failure modes, on-call and incident response, and capacity, performance, and tooling.
Reliability Fundamentals (Q1-12)
1. What's the difference between SLI, SLO, and SLA?
SLI is the measurement (a number). SLO is the target you commit to internally (a goal). SLA is the contract you sign with customers, usually with money on the line if you miss it. SLOs are tighter than SLAs - the gap is your safety margin.
2. What's an error budget?
The amount of unreliability allowed in a period before you've broken your SLO. If your SLO is 99.9% uptime, you have ~43 min of allowed downtime per month. Burn it, you stop shipping risky changes and focus on reliability.
3. Why is 100% reliability a bad target?
The cost of each additional nine grows exponentially, and beyond a point users can't tell the difference (their network and devices fail more often than your service). 100% also leaves no room for shipping changes - you'd be frozen.
4. How do you choose what to measure for SLIs?
Start from the user's perspective: what does "this is working" mean to them? Then translate that into measurable signals on your side - request success rate, latency at p95/p99, data freshness, correctness checks.
5. What's the difference between availability and reliability?
Availability: is the service up? Reliability: does it work correctly when it's up? A service can be 100% available and still completely broken if it's returning wrong answers.
6. What's a "burn rate" alert?
An alert that fires when you're consuming error budget faster than the SLO window allows. A high burn rate (e.g., 14x the allowed rate) means you'll exhaust the budget quickly - alert immediately. A low burn rate sustained over hours also matters - alert on the slow leak, just with different urgency.
7. How do you decide if something is "user-impacting"?
The user noticed (or would notice if they tried). Internal errors that don't surface to users aren't user-impacting, even if they look scary in dashboards. Don't burn error budget on stuff users can't see.
8. What's the SRE role's relationship with developers?
Partnership, not gatekeeping. SREs codify reliability requirements (SLOs, deployment guardrails, observability standards) so developers can move fast within them. The ratio is usually 1 SRE per 5-10 developers.
9. What's "toil" and why does it matter?
Manual, repetitive, automatable work that scales linearly with system size. Google's SRE book recommends capping toil at 50% of SRE time so engineering work to reduce future toil doesn't get squeezed out.
10. How do you know when to invest in reliability vs. features?
Error budget. If you're burning it fast, slow feature work and invest in reliability. If you're under budget, ship more aggressively. The numbers, not the loudest stakeholder, drive the decision.
11. Explain the "blameless postmortem" principle.
Investigate what the system did wrong, not who. Humans make mistakes when systems make those mistakes possible. Postmortems should produce systemic fixes, not individual lectures.
12. What's the difference between proactive and reactive reliability work?
Reactive: responding to outages and pages. Proactive: chaos engineering, capacity planning, automated remediation, dependency hardening. Mature teams spend more time proactive than reactive.
Distributed Systems and Failure Modes (Q13-25)
13. What is the CAP theorem and how does it actually apply in practice?
In a partition, you choose between consistency and availability. Most real systems are AP (availability + partition tolerance) with eventual consistency, but specific subsystems (financial ledgers, locks) demand CP. CAP isn't either/or globally - it's per operation.
14. What's a thundering herd and how do you mitigate it?
Many clients suddenly hit a service at once (e.g., after a cache expiry). Mitigations: jitter on retries, randomized cache TTLs, request coalescing, exponential backoff, circuit breakers.
15. What's a cascading failure?
A small failure in one service triggers retries that overload another, which fails, which triggers more retries upstream, etc. The system collapses inward. Prevent with circuit breakers, deadlines on requests, and load shedding.
16. How do you handle retries safely?
Retries with no jitter cause thundering herds. Retries without idempotency cause data corruption. Retries without bounded attempts cause infinite amplification. Use exponential backoff with full jitter, idempotency keys, and a max attempt cap.
17. What's a circuit breaker?
A pattern that stops sending requests to a downstream service that's failing, so you don't pile on. After a cooldown, it sends a probe request - if successful, traffic resumes. Prevents cascading failures.
18. Explain "load shedding."
When a service is at capacity, drop low-priority requests early so high-priority ones still succeed. Better to fail 10% of requests cleanly than to fail all of them by overload.
19. What's the difference between consistency models (strong, eventual, causal)?
Strong: all readers see the latest write immediately. Eventual: readers eventually see the write, with no time guarantee. Causal: if A causally precedes B, all readers see them in that order. Choose based on user expectations, not engineering preference.
20. How does a leader election work?
A consensus protocol (Raft, Paxos) where nodes agree on a single leader. Only the leader writes. If the leader fails, remaining nodes elect a new one. Used in databases (etcd, Consul, ZooKeeper) and many distributed systems.
21. What's the split-brain problem?
A network partition where two parts of a cluster each think they're the only surviving group, both elect leaders, and both accept writes. Recovery is painful and may lose data. Prevent with quorum (majority rule).
22. How do you handle clock skew across distributed nodes?
Don't trust wall clocks for ordering. Use logical clocks (Lamport timestamps), vector clocks, or hybrid logical clocks. For absolute time, use NTP and accept bounded uncertainty.
23. What's idempotency and why is it critical for distributed systems?
The same operation produces the same result regardless of how many times it runs. Without idempotency, retries are dangerous - they might charge a card twice or send duplicate emails. Use idempotency keys for all mutating operations.
24. What's eventual consistency and what's a real-world example?
Writes propagate to all replicas eventually, but readers might briefly see stale data. Example: DNS - update a record and it takes minutes-to-hours to propagate globally.
25. How would you debug a service that's "slow but not down"?
Start with traces of slow requests, not metrics. Look for queueing (request bottlenecks), look for noisy neighbors (resource contention), check garbage collection pauses, profile CPU. Slow-but-not-down is almost always saturation somewhere - find it.
On-Call and Incident Response (Q26-37)
26. Walk me through how you'd respond to a Sev-1 page.
Acknowledge within minutes. Confirm the alert is real (look at user-facing signals, not just internal metrics). Establish a comms channel (Slack, war room). Assign roles: incident commander, comms lead, ops lead. Mitigate first, root cause later. Update stakeholders on a regular cadence.
27. What's the difference between mitigation and remediation?
Mitigation: stop the bleeding (revert, restart, failover, drain traffic). Remediation: fix the root cause permanently. Always mitigate first - users don't care about the elegant fix while they're broken.
28. What's an incident commander and why does the role exist?
The single person making coordination decisions during an incident. Without an IC, the team fragments - everyone debugging independently, no shared status, contradictory comms. The IC doesn't have to be the most senior engineer; they have to keep the response organized.
29. What goes in a good postmortem?
Timeline (with specific times), impact (users affected, duration, error budget burned), root cause (the actual cause, not the proximate one), action items (with owners and dates), and lessons learned beyond this specific incident.
30. What's "page fatigue"?
On-call engineers ignoring or dismissing alerts because the noise-to-signal ratio is too high. It's a leading indicator of a serious miss. Track alerts per shift; if it's above 2-3 per night, something is broken.
31. How do you tune noisy alerts without missing real issues?
Audit fired alerts - which ones produced action vs. were dismissed? Tighten thresholds on dismissed ones. Add hysteresis (require N consecutive failures). Move informational alerts to dashboards, keep paging for things requiring human action.
32. How do you onboard someone to on-call?
Shadow shifts before primary. Run a tabletop exercise on a past incident. Verify they can access all systems they'll need (auth issues at 3am are brutal). Pair their first 1-2 primary shifts with an experienced backup.
33. What's the "follow the sun" model and when is it worth it?
Distribute on-call across geographies so each rotation is daytime hours locally. Worth it when you have 24/7 critical traffic and engineers in 2-3 regions. Not worth it for small teams - the coordination overhead exceeds the night-shift relief.
34. What's "operational readiness" and why does every new service need it?
A checklist that a service must satisfy before going to production: SLOs defined, dashboards exist, alerts route to the right team, runbooks for known failures, capacity headroom verified, security review complete. Skip ops-readiness, you import problems into prod.
35. How do you handle an incident where multiple things broke simultaneously?
Triage by user impact. Which broken thing is causing the most pain? Mitigate that first. Don't try to debug everything in parallel - pick a queue order, communicate it, work it.
36. How do you handle incidents that span multiple teams?
Single incident commander, even across teams. Each team has a sub-lead reporting to the IC. Comms go through one channel, not five. Without this, you get the "everyone's working on it, nobody's coordinating" failure mode.
37. When should you NOT roll back?
When the rollback itself is risky (database migration that's already partially applied, traffic patterns the old version can't handle). In those cases, roll forward with a fix.
Capacity, Performance, and Tooling (Q38-50)
38. How do you do capacity planning?
Start with current usage and growth rate. Add safety margin (typically 30-50% headroom). Account for known events (Black Friday, product launches). Stress-test to know your real ceiling, not the theoretical one. Re-plan quarterly.
39. What's the difference between vertical and horizontal scaling?
Vertical: bigger machine. Horizontal: more machines. Vertical hits hardware limits and creates a single point of failure. Horizontal scales further but requires the workload to be distributable.
40. What's "graceful degradation"?
When a dependency fails, the service serves a reduced experience instead of failing entirely. Example: search returns popular results from cache when the personalization service is down.
41. How do you load test a system without breaking production?
Shadow traffic (replay real prod traffic to a staging copy). Production load tests during off-peak windows with a clear blast radius limit. Pre-production environments that mirror prod scale (rare but valuable). Synthetic load with realistic distributions.
42. What's a "kill switch" or "feature flag" for capacity reasons?
A mechanism to disable expensive code paths without redeploying. When traffic spikes beyond capacity, flip the switch to disable non-essential features (recommendation engines, decorations, analytics) and free up capacity for core flows.
43. What's the difference between request-level and connection-level rate limiting?
Request-level: count requests per user/key. Connection-level: cap concurrent open connections. Request limits stop abuse, connection limits stop slow clients from exhausting your worker pool.
44. How do you think about cost vs. reliability?
Reliability has a real cost (extra capacity, redundancy, on-call hours). Costs should map to SLOs - higher SLOs justify more spend. If your SLO is 99.5%, you don't need multi-region active-active; if it's 99.99%, you probably do.
45. What's an "error budget policy"?
A pre-agreed rule for what happens when error budget is exhausted: feature freeze, postmortem priority, deploy gate. Decisions made in advance avoid arguments during stressful moments.
46. How do you build trust with developers as an SRE?
Show up for incidents. Make their work easier (better tooling, better dashboards, faster builds). Don't gatekeep - give them the tools to self-serve operational tasks. Be the person who makes shipping smoother, not the person who blocks it.
47. What's "toil reduction" in practice?
Identify the most-frequent manual task (look at on-call logs, ticket queues). Automate it or eliminate it. Track the time saved. Repeat. Done well, toil reduction quietly compounds - what took 4 hours/week now takes 0.
48. What's a "playbook" and why does it matter?
A pre-written response procedure for a known incident type. Reduces time-to-mitigate, reduces decisions made under stress, and lets less-experienced engineers handle situations that previously required senior intervention.
49. How do you handle technical debt that's slowing reliability work?
Quantify the cost (time spent on workarounds, on-call burden, missed SLOs). Tie it to a business outcome leadership cares about. Get a budget. Pay it down systematically, not all at once.
50. What's the most common mistake new SREs make?
Trying to fix everything immediately. Real SRE impact compounds over months and years - tightening alerting, improving observability, raising the floor on reliability. New SREs who burn out trying to fix every paper cut don't last. Pick the highest-leverage thing each quarter and ship it.
Key Takeaways
- SRE interviews favor candidates who think in terms of user impact, not technical purity
- Real on-call experience shows in answers - hand-wavy responses signal you haven't actually been paged
- The hardest questions are operational, not theoretical - don't skip the "how would you respond?" prep
- Postmortems, SLOs, and error budgets are table stakes - know them cold
- The best preparation is operating something real long enough to learn what breaks
Preparing for an SRE interview? gitGood.dev has thousands of practice questions and mock interview simulations covering reliability, distributed systems, and incident response. The questions above are starting points - the real depth comes from the failure modes you've personally seen.