DevOps interviews in 2026 look almost nothing like they did three years ago. Companies stopped asking "what is a Dockerfile?" and started asking "how would you cut our build pipeline from 22 minutes to under 5?" The bar moved up because the tooling moved down - everyone has CI/CD now, so what hiring managers want is judgment about how systems actually run in production.
This guide covers the 50 questions you should expect across the four buckets that show up in nearly every DevOps loop: pipelines and delivery, infrastructure and configuration, observability and reliability, and security and incident response. Answers are short on purpose. Use them to identify the gaps in your own reasoning, not as scripts to memorize.
CI/CD and Build Pipelines (Q1-12)
1. What's the difference between continuous delivery and continuous deployment?
Continuous delivery means every commit produces a deployable artifact, but a human approves the release. Continuous deployment removes the human gate - every passing commit ships to production automatically.
2. How do you keep build times under 5 minutes for a large repo?
Cache aggressively (dependency caches, layer caches, test result caches), parallelize independent steps, run only the test suite affected by the changed files (test impact analysis), and move slow integration tests to a separate post-merge pipeline.
3. What is a "fast feedback loop" and why does it matter for DevOps culture?
The shorter the time between writing code and seeing the result of that code in a real environment, the faster engineers learn. Slow pipelines silently degrade code quality because engineers stop iterating and start batching.
4. How do you prevent flaky tests from blocking releases?
Track flake rate per test, auto-quarantine tests that fail more than X% of the time, require a fix before they re-enter the main pipeline. Never allow "just rerun it" as a permanent strategy.
5. What is trunk-based development?
All engineers commit to a single shared branch (main/trunk), with feature flags hiding incomplete work. Branches live for hours, not weeks. Merges are continuous and conflicts are rare.
6. When would you choose GitHub Actions over Jenkins (or vice versa)?
GitHub Actions wins for repos already on GitHub, modern hosted runners, and minimal ops overhead. Jenkins wins for complex multi-repo workflows, tight integration with on-prem systems, and teams that need full control of the build environment.
7. How do you secure secrets in a CI/CD pipeline?
Use the platform's native secrets store (Actions secrets, GitLab CI variables, Jenkins credentials), never echo secrets in build logs, scope secrets per environment, rotate them on a schedule, and use short-lived OIDC tokens for cloud access where supported.
8. Explain blue/green deployment.
Run two identical production environments. Deploy new code to the inactive one, smoke-test it, then flip the load balancer. Rollback is instant - flip back.
9. What's a canary deployment and when is it better than blue/green?
A canary routes a small percentage of traffic to the new version, monitors for errors, and gradually increases. It's better when failures are slow to surface or hard to detect with synthetic tests - the canary catches issues real users find.
10. What is a deployment freeze and when should you have one?
A period where production deploys are paused. Common during major launches, holiday peak traffic, on-call gaps, or after a serious incident while the postmortem completes.
11. How do you handle database schema changes in CI/CD?
Decouple schema migrations from code deploys. Make migrations backward-compatible, run them before the code that depends on them ships, and never combine breaking schema changes with code in the same release.
12. What's the role of feature flags in modern CI/CD?
They separate "deploy" from "release." Code can ship to production turned off, then be enabled gradually for specific user segments. They're how mature teams ship continuously without breaking production.
Infrastructure as Code and Configuration (Q13-25)
13. Why is Terraform state important and where should you store it?
State is how Terraform tracks what it has created. Lose state and Terraform can't manage your resources. Store it in a remote backend (S3 with DynamoDB locking, Terraform Cloud, GCS) - never in git.
14. What's the difference between Terraform and Pulumi?
Terraform uses HCL, a domain-specific language. Pulumi uses real programming languages (TypeScript, Python, Go). Pulumi gives you better abstraction primitives at the cost of more rope to hang yourself.
15. When would you use Helm over raw Kubernetes manifests?
When you need templating, versioned releases, or you're deploying the same app across many environments with different configs. For a single internal service, raw manifests with Kustomize are often simpler.
16. What's the difference between immutable and mutable infrastructure?
Immutable: you replace servers instead of modifying them. Mutable: you SSH in and change things. Immutable is the modern standard because it eliminates configuration drift.
17. Explain the "phoenix server" pattern.
A server that can be destroyed and recreated automatically without losing functionality. The opposite of a snowflake server (one-of-a-kind, hand-tuned, scary to touch).
18. How do you handle Terraform in a team setting without conflicts?
Use remote state with locking, break your codebase into small modules with clear ownership, run plans in CI on every PR, and apply only from a controlled CI environment, never from a developer's laptop.
19. What's a Kubernetes Operator?
A controller that extends Kubernetes with custom logic for managing complex applications. Operators encode the operational knowledge ("how to safely upgrade Postgres") that a human SRE would otherwise have to apply manually.
20. How does Kubernetes handle pod scheduling?
The scheduler watches for unscheduled pods and finds a node based on resource requests, taints/tolerations, node affinity, and topology constraints. You can influence scheduling, but you don't control it directly.
21. What's the difference between a StatefulSet and a Deployment?
Deployments treat pods as interchangeable cattle. StatefulSets give pods stable identities and persistent storage - use them for databases, queues, or anything where pod identity matters.
22. How do you approach secret management at scale?
External secret store (AWS Secrets Manager, HashiCorp Vault, Doppler), short-lived dynamic credentials where possible, rotation policies enforced automatically, audit logs for every secret access.
23. What's GitOps and what problem does it solve?
Your git repository is the single source of truth for infrastructure state. A controller (ArgoCD, Flux) reconciles the cluster to match git. It solves the "who changed what and when" problem and makes rollback a git revert.
24. When does configuration management (Ansible, Chef) still make sense in 2026?
Hybrid environments, edge devices, on-prem hardware, anything not running in containers. Pure cloud-native shops have largely moved past configuration management for application servers.
25. How do you manage IAM at scale across many AWS accounts?
AWS Organizations + IAM Identity Center for human access, IAM roles with trust policies for cross-account service access, infrastructure-as-code for all role definitions, and automated drift detection.
Observability and Reliability (Q26-37)
26. What are the three pillars of observability?
Metrics (numerical time-series), logs (structured event records), and traces (request flow across services). The fourth pillar increasingly cited in 2026 is profiles (continuous CPU/memory profiling).
27. What's the difference between monitoring and observability?
Monitoring asks pre-defined questions ("is CPU above 80%?"). Observability lets you ask new questions of your system without shipping new code. Observability is what you need when you don't know what's broken yet.
28. How do you set good SLOs?
Pick metrics users actually care about (latency, availability, correctness), set targets just below what users tolerate, and use error budgets to balance reliability work against feature work. SLOs aren't internal bragging rights - they're a contract.
29. What's an error budget?
The acceptable amount of unreliability per period. If your SLO is 99.9% availability, you have 43 minutes of downtime per month before you blow the budget. Burn the budget, you stop shipping features and focus on reliability.
30. Explain the difference between latency, throughput, and saturation.
Latency: how long one request takes. Throughput: how many requests per second. Saturation: how close the system is to its limits. They're related but distinct - high throughput with low saturation is fine, low throughput with high saturation is a bottleneck.
31. What's the USE method?
For each resource, check Utilization, Saturation, and Errors. It's a quick triage method - within 60 seconds of an alert, you should know which resource is the problem.
32. What's the RED method?
For each service, watch Rate, Errors, and Duration. It's a complement to USE - USE for resources, RED for services.
33. How do you debug high tail latency?
Trace a slow request end-to-end, look for queueing, check garbage collection patterns, profile the slowest endpoints, look for resource contention (locks, connection pools), and check whether tail latency correlates with traffic spikes or cron jobs.
34. What's a chaos engineering experiment?
Intentionally inject failures (kill pods, drop network packets, throttle CPU) to verify your system handles them. Run them in production gradually, with safeguards, after first running them in staging.
35. Why are dashboards alone insufficient?
Dashboards answer questions you already know to ask. For unknown unknowns, you need ad-hoc query tools (Honeycomb, ClickHouse, Loki) where engineers can slice and dice raw data.
36. What's the difference between a Sev-1 and a Sev-3 incident?
Sev-1: customer-impacting, all-hands, immediate response. Sev-3: minor or no customer impact, fix during business hours. Definitions vary by company - get familiar with your team's exact thresholds before being on-call.
37. What does a good postmortem look like?
Blameless tone, clear timeline, root cause (not just proximate cause), action items with owners and dates, and lessons that apply beyond this specific incident. The point isn't to file paperwork - it's to make the system stronger.
Security, Cost, and Incident Response (Q38-50)
38. What's the principle of least privilege?
Every user, role, and service gets the minimum permissions required to do its job - nothing more. The default should be deny; access is granted intentionally, not inherited.
39. How do you handle secrets in container images?
Don't bake them in. Pass them at runtime via environment variables from a secrets manager, mount them as files from a CSI driver, or use workload identity to fetch them at startup.
40. What's image scanning and when should it happen?
Scan container images for known vulnerabilities (CVEs) at build time and again before deploy. Modern setups also scan at runtime to catch newly disclosed CVEs in already-deployed images.
41. What's an SBOM and why does it matter?
Software Bill of Materials - a manifest of every dependency in a build artifact. When the next Log4j-style vulnerability hits, an SBOM lets you instantly answer "are we exposed?" instead of grepping through repos for hours.
42. How do you reduce cloud costs without breaking things?
Right-size compute (most workloads are over-provisioned by 30-50%), commit to reserved capacity where usage is predictable, kill orphaned resources, set cost alerts per team, and tag resources so you can attribute spend.
43. What's a runbook and why does every on-call rotation need them?
A step-by-step procedure for handling a specific incident. They turn 3am pages from "what do I do?" into "follow the runbook." Every recurring alert should have one.
44. How do you onboard a new engineer to on-call?
Shadow shifts before primary, paired with an experienced engineer. A practice incident in a non-prod environment. Clear escalation paths. Explicit permission to wake people up - new engineers under-escalate.
45. What's the difference between MTTR and MTTD?
MTTD: Mean Time To Detect (alert fires). MTTR: Mean Time To Resolve (impact ends). Improving MTTD is about better monitoring; improving MTTR is about better tooling, runbooks, and team reflexes.
46. How do you handle a security incident?
Contain (stop the bleeding), eradicate (remove the cause), recover (restore service), and document (postmortem and disclosure). The first hour is usually about containment - don't worry about clean attribution yet.
47. What's "shift-left" security?
Move security checks earlier in the development cycle - SAST in the IDE, dependency scanning at PR time, IaC scanning before merge - so vulnerabilities get caught before code reaches production.
48. How do you manage on-call burnout?
Track page volume per engineer per week, fix or auto-resolve recurring noisy alerts, ensure 6+ engineers in the rotation so each rotation is infrequent, and pay for on-call (it's labor).
49. What's a "platform engineering" team and how is it different from a traditional ops team?
A platform team builds internal products that make application engineers self-sufficient (deploy tools, observability dashboards, IaC modules). Traditional ops teams operate systems on behalf of others. Platform thinking treats internal engineers as customers.
50. Walk me through how you'd debug an alert that says "API latency p99 above 2s."
Confirm it's real (look at the actual metric, not just the alert). Check recent deploys - is this a regression? Look at downstream services - is it cascading? Check resource saturation on the affected service. If still unclear, sample slow requests and trace them. Communicate status the whole time.
Key Takeaways
- The bar shifted from "do you know the tools?" to "can you reason about production behavior?"
- Pipeline questions favor candidates who think about feedback loops, not those who memorize YAML syntax
- Observability questions reward people who've actually been on-call - vague answers signal you haven't
- Security and cost are no longer separate tracks - DevOps engineers in 2026 are expected to think about both
- The single best preparation: spend time genuinely operating a non-trivial system, even a side project
Preparing for a DevOps interview? gitGood.dev has 1,000+ practice questions, mock interview simulations, and infrastructure-focused coding challenges. The questions on the list above are starting points - the real depth comes from building and operating real systems.