What's the scale - thousands of services, hundreds of thousands? Million-host fleet or smaller?

10M metric data points/sec ingest. 100 TB/day logs. 1B traces/day, 10B spans/day. 1000 unique services. 10K engineers as users. 100K alerting rules.

How would you migrate from Datadog to a self-hosted stack without losing dashboards?

OpenTelemetry (OTel) is the CNCF standard for instrumentation. Within five years, every observability platform will be OTel-native; resisting is a rear-guard action. What OTel provides A vendor-neutral instrumentation API (metrics, logs, traces). A vendor-neutral wire protocol (OTLP). A vendor-neutral collector (the OTel Collector) that receives, transforms, and exports to any backend.

gitGood.dev

Distributed Systems

Design an Observability Platform (Metrics, Logs, Traces)

Hard Premium

Time-series DBs (Prometheus, M3, VictoriaMetrics), trace sampling, exemplars, OpenTelemetry, alerting, and the cardinality explosion that turns a $10K/month platform into a $1M/month outage.

The problem

Design the observability platform for a 10,000-engineer company. Hundreds of services, billions of metric data points per day, terabytes of logs, and traces spanning 50+ hops per user request. Engineers must diagnose incidents in minutes, not hours. The platform itself must cost less than 5% of total infra spend - which is a hard constraint at this scale, not a nice-to-have.

This is the question that separates "I configured Datadog" from "I have run an observability fleet". Strong candidates separate the three signals (metrics, logs, traces), explain cardinality math, design the sampling story for traces, and own the cost model. Excellent candidates discuss exemplars (the bridge that links a metric anomaly to the trace that caused it) and explain why pure tail-based sampling has scaled poorly historically.

Clarifying questions

Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.

→What's the scale - thousands of services, hundreds of thousands? Million-host fleet or smaller?
→Are we building all three signals (metrics, logs, traces) or starting with one?
→Retention - 7 days, 30 days, 1 year per signal? Compliance requirements?
→Engineers expected to write their own queries, or are we curating dashboards?
→Multi-tenant (each team isolated) or shared?
→Self-hosted on-prem, managed cloud (Datadog/New Relic), or open source self-managed (Grafana stack)?
→Real-time alerting (sub-minute) or eventual (5-15 min lag acceptable)?
→Does the SRE org own the platform, or is it a paid-for product to internal teams?

Requirements

Functional requirements

·Ingest metrics (numeric time series, labels), logs (structured + unstructured), and traces (request spans across services)
·Query each signal type with appropriate languages (PromQL, LogQL, traceQL)
·Cross-link signals: a metric anomaly should jump to the relevant logs and traces
·Alerting on metric expressions with deduplication and routing
·Dashboards: composable, version-controlled, sharable
·Per-team multi-tenancy with quotas, isolation, and cost attribution
·Sampling strategies for traces (head-based, tail-based, rule-based)

Non-functional requirements

Scale: 10M metric data points/sec ingest. 100 TB/day logs. 1B traces/day, 10B spans/day. 1000 unique services. 10K engineers as users. 100K alerting rules.
Latency: Metric ingest → query visibility: <30s. Log ingest → search: <1min. Trace assembly → query: <2min. Alert firing latency: <60s for critical. Dashboard query p99: <5s.
Availability: 99.9% for ingest (data loss is unacceptable during incidents - that's exactly when you need it most). 99% for query (degraded query is recoverable; lost ingest is not).
Consistency: Eventual. A metric written now might be queryable in 5s. A log might be searchable in 30s. Traces require all spans before assembly - up to minute-scale lag is acceptable.

Capacity estimation

Metrics

10M data points/sec × 8 bytes/point compressed = 80 MB/s ingest = 7 TB/day raw.
After downsampling and compression: ~1.5 TB/day effective storage.
30-day retention = 45 TB hot storage. Sharded across ~20 ingest nodes.

Logs

100 TB/day ingest. Most logs are read 0 times.
Hot tier (7 days, indexed): 700 TB. Sharded across ~50 storage nodes with SSD.
Warm tier (30 days, indexed-on-demand): 3 PB on object storage.
Cold tier (1 year, raw archive): 36 PB on Glacier-class. Recoverable in hours for compliance.

Traces

1B traces/day × 100 spans avg = 100B spans/day.
After head-sampling at 1%, tail-sampling rules adding back errors/slow: ~5B kept spans/day.
5B × 1KB/span = 5 TB/day. 30-day retention = 150 TB.

Cardinality (the silent killer)

A metric with 5 labels, each with 10 values = 100K distinct time series.
Add a label like user_id (10M values) and you get 1B time series for one metric.
Each unique time series costs ~1KB of memory in the ingest layer regardless of write rate. 1B series = 1 TB of memory = melted Prometheus.

Compute

Ingest: ~50 nodes per signal type (50 metrics ingest, 50 logs ingest, 30 trace ingest).
Query: separate fleet, scaled to dashboard + alert load. ~30 query nodes, sharded by tenant.
Storage: object-storage-backed, ~2 PB usable across all signals.

Cost order-of-magnitude

Self-hosted Grafana stack on commodity hardware: ~$500K-$2M/year for the scale above.
Managed Datadog at the same scale: $5M-$15M/year (per-host + per-million-spans + per-GB-logs all add up).
The 10x gap is why every company at scale eventually self-hosts metrics or moves logs to a cheaper tier.

High-level architecture

The platform is three pipelines (metrics, logs, traces) that share infrastructure (ingestion gateway, storage tiering, query plane, observability of itself) but differ sharply in their data shapes and query patterns.

The defining decisions: (1) where to enforce cardinality limits (at ingest or via aggregation), (2) head vs tail vs rule-based trace sampling (each has sharp trade-offs), (3) whether logs and traces share storage (combining can save 30% but locks you in), (4) the alerting fan-out architecture (per-rule evaluation vs centralized engine), and (5) the multi-tenant isolation model (separate stacks per team vs shared with quotas).

The defining operational property: the platform must work during incidents. A platform that becomes degraded when its dependencies are degraded is useless precisely when needed. This forces independence (separate region, separate VPC, separate everything) - and that independence drives cost.

OpenTelemetry collector / agent

Universal ingest agent on every host. Receives metrics, logs, traces from instrumented apps. Performs local aggregation, sampling, and forwarding. The single ingest point standardizes the protocol and pushes work to the edge.

Metrics ingest gateway

Receives metric writes (Prometheus remote_write, InfluxDB line protocol, OTel). Validates labels (cardinality cap), compresses, batches, writes to TSDB. Stateless behind a sharded TSDB.

Time-series database (Prometheus / M3 / VictoriaMetrics / Mimir)

Sharded ingest of high-cardinality time series. Stores compressed blocks on local SSD; offloads older blocks to object storage. Serves PromQL queries. The hot path of the metrics signal.

Logs ingest pipeline

Receives log lines (JSON or text). Parses, enriches with metadata (service, host, env). Routes to hot/warm storage. Optional indexed search (Elasticsearch / Loki / VictoriaLogs).

Trace ingest + assembler

Receives spans (OTLP). Buffers them by trace ID until the trace is complete (or timeout). Applies tail-based sampling rules (keep all error traces, slow traces, rare traces). Writes kept traces to storage.

Storage tiering

Hot (SSD, indexed, queries in sub-second), warm (object storage, indexed-on-demand, queries in seconds), cold (object archive, queries in minutes). Lifecycle policies move data automatically.

Query engine

Per-signal query API. PromQL for metrics, LogQL for logs, traceQL for traces. Federates across shards; pushes filters down; caches frequent queries.

Alerting engine

Evaluates alert rules on a schedule (typically every 30s for critical). Deduplicates, groups, routes to PagerDuty / Slack. Supports thresholds, anomaly detection, multi-condition rules.

Dashboard layer (Grafana)

User-facing visualization. Queries the query engine; renders charts, tables, traces. Version-controlled dashboards as code.

Self-monitoring stack

A separate, smaller observability deployment that monitors the main observability platform. Required so the team has signals when the main platform is unhealthy. Lives in a different region/VPC.

Deep dives

The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.

1. Cardinality explosion: the failure mode that kills naive deployments

Cardinality is the count of unique time series. It dominates cost and is the single most common cause of platform outages.

The math
A metric is identified by its name + label set. Each unique combination of labels is a separate time series.

http_requests_total{method, status, endpoint, user_id, region}

If method has 5 values, status has 10, endpoint has 100, user_id has 10M, and region has 5:
5 × 10 × 100 × 10M × 5 = 250 billion time series.

Even if each series is small (~1KB resident memory), 250B series = 250 TB. The TSDB OOMs.

The most common offender: unbounded labels

user_id, request_id, session_id, transaction_id - high-cardinality identifiers attached to metrics.
URLs with path parameters (/users/123, /users/124, ...) - each becomes a unique label value.
Stack traces in error metrics.
Email addresses, IP addresses in development metrics.

These almost always sneak in via well-meaning instrumentation: "let's tag this metric with user_id so we can break it down by user". The result is a platform-killing label.

Detection

Track time-series-count per metric. Alert when a metric exceeds 100K series.
Track time-series-growth-rate. A metric whose series count doubled in an hour is a fire drill.
Per-tenant series count + budget enforcement.

Mitigation

Cap at ingest: reject writes that exceed per-metric cardinality budgets. Painful but effective.
Drop-the-label: when ingest detects an exploding label, drop that label automatically. Loses the dimension but preserves the metric.
Aggregate: instead of per-user counters, emit per-cohort counters (10 user-segment buckets, not 10M user_id values).
Move to logs/traces: if the dimension matters per-user, store it as logs or traces (which scale by volume, not by series count) and use exemplars to link.

Educational layer
The platform must teach engineers cardinality discipline. Pre-deploy linting on metric names + labels. Onboarding docs. Per-team office hours. Detected explosions get a Slack ping with the offending metric and the suggested fix.

The "we'll fix it later" trap
Once a high-cardinality metric is in production, removing the label feels like a breaking change ("but my dashboard uses it"). It accumulates. The platform team's job is to enforce limits at ingest, not at policy-discussion time.

The right mental model
Metrics are aggregations. If you find yourself wanting per-event detail, you want logs or traces, not metrics. Treat metrics as "the questions I ask without specifying the entity"; logs/traces as "the questions I ask about specific entities".

2. Trace sampling: head, tail, and rule-based

Tracing every request at full detail is unaffordable at scale. Sampling is mandatory; the strategy determines which incidents you can debug.

Head-based sampling
Decision is made at the start of the trace, propagated to all child spans. Most commonly: keep N% of traces uniformly.

Pros: cheap (drop spans at the originating service, never collected). Decision is consistent across a trace (no half-traces).
Cons: you keep N% of normal traces and N% of error traces. Errors are rare; sampling them at 1% means most errors have no trace.

Use when: you mostly care about typical request shape, not rare incidents.

Rule-based head sampling
Sample at higher rates for specific cases: 100% of traces with a debug header, 100% from a specific test account, 50% from a specific endpoint.

Pros: targeted depth without exploding cost.
Cons: requires per-rule maintenance; rules drift.

Use when: you have known-important request types worth full visibility.

Tail-based sampling
Wait until the trace completes; decide based on outcome. Keep 100% of error traces, 100% of traces > p99 latency, X% of normal.

Pros: every error has a trace. Slow traces get analyzed. Headroom isn't wasted on uneventful requests.
Cons: must buffer all spans for ~30s before sampling decision. Memory and CPU cost. Hard to do across services without a centralized assembler.

Use when: you have the infrastructure to buffer + decide. Production-grade observability tools (Tempo, Honeycomb, Lightstep, Jaeger with collector) all support tail sampling.

Hybrid: head + tail
Head-sample at 100% (or close to it) into a temporary buffer. Tail-sample to decide what to persist long-term. The "100x ingest, 1x storage" model.

Many production systems use this: cheap to collect spans (5-10 GB/sec at scale isn't a lot of network), expensive to store them long-term, so the buffer + decide pattern wins.

Per-service sampling (the "bait and switch")
A service can sample at 1% locally but propagate the decision to children. If the parent decides "keep this trace", every downstream service keeps it too. Without this, you get partial traces (parent kept, child dropped) which are often worse than no trace.

OpenTelemetry's W3C trace context propagation handles this, but every service must respect the sampling bit.

Exemplars: the bridge between sampled trace and unsampled metric
Even with 1% trace sampling, every metric is computed from 100% of requests. The bridge: when you increment a metric, occasionally attach an exemplar (a sampled trace ID + the value).

When an engineer sees a latency spike on a metric, the exemplar lets them jump to a trace that contributed to the spike - even though they didn't predict which traces would matter.

This pattern (Prometheus exemplars + Tempo) has reshaped how teams debug latency regressions. It's the practical answer to "we sample 1% of traces but we want every alert to surface a relevant trace".

Cost tuning

Default head-sample rate: 1-10% depending on service.
Tail rules: keep all errors, all >p99 traces, all sampled-by-rule traces.
Exemplar rate: 1% per metric bucket, capped at N exemplars per minute per metric.

Engineers should be able to escalate sampling for a specific service / endpoint during incident or performance work, then drop back. The platform makes this a self-serve knob.

3. Time-series databases: Prometheus, M3, VictoriaMetrics, Mimir, Cortex

The TSDB is the engine of the metrics signal. Choosing wrong locks the platform into the wrong cost curve.

Prometheus (single-node)
The reference implementation. PromQL is the de-facto query language. Single-binary, simple to operate, beloved by SREs.

Pros: best PromQL ergonomics. Strong community. Great pull-based scrape model.
Cons: single-node. Caps at ~10M active series. No long-term storage; no replication; no native HA.

Use when: <100 services, single team, <30-day retention. Outgrown by most companies past Series B.

M3 (Uber)
Sharded, replicated TSDB built on M3DB. Scales to billions of series. Used by Uber, Chick-fil-A, others.

Pros: proven at scale. Tiered storage. Multi-DC replication.
Cons: operationally heavy (Cassandra-class complexity). Smaller community than alternatives.

Use when: massive scale + dedicated platform team with distributed-systems chops.

VictoriaMetrics
Drop-in Prometheus replacement with much better resource usage. Both open source and commercial editions.

Pros: 5-10x lower memory than Prometheus at the same load. Single-node and clustered modes. Strong write throughput.
Cons: Smaller ecosystem than Prometheus + Thanos.

Use when: cost-driven, want simple ops, mid-scale (10s of millions of series).

Mimir / Cortex (Grafana ecosystem)
Horizontally scalable Prometheus-compatible TSDB. Mimir is the modern fork, Grafana-Labs maintained.

Pros: cloud-native architecture (object storage + ingester + query). Multi-tenant native. Active development.
Cons: many moving parts (ingester, distributor, store-gateway, compactor, querier). Object-storage-backed query latency.

Use when: large multi-tenant platform, heavy use of Grafana stack, comfortable with Kubernetes.

Thanos
Adds long-term storage + global query to vanilla Prometheus. Sidecar pattern.

Pros: incremental adoption from existing Prometheus.
Cons: query performance degrades at very high cardinality. Operational complexity grows.

Use when: existing Prometheus footprint, need to extend retention without re-platforming.

Cloud-native options

AWS Managed Prometheus: hosted Cortex, pay-per-sample. Convenient, expensive at scale.
Google Managed Service for Prometheus: similar. Tight integration with GKE.
Datadog / New Relic: full SaaS, includes other signals. Easy to start, expensive to keep.

The decision framework

Below 10M series: Prometheus + remote storage.
10M-100M series: VictoriaMetrics or Mimir.
100M+ series: M3, Mimir cluster, or managed offering.
Cost-driven: VictoriaMetrics wins on $/series.
Ops-burden-minimized: cloud-managed wins.

The migration cost
Switching TSDBs is months of work + dual-write transition + dashboard re-validation. Choose for 3-year scale, not 6-month scale.

4. Logs at scale: indexed vs grep-on-object-storage

Logs are the highest-volume signal. At 100 TB/day, the cost difference between architectures is millions of dollars per year.

The traditional model: Elasticsearch
Index every field at write time. Subsecond query on any field. The Splunk / ELK model.

Pros: fast ad-hoc query. Full-text search.
Cons: index size = 1-3x the raw data. Memory-hungry (heap + page cache). Operational complexity: shard rebalancing, hot/cold tiers, frequent OOMs.

Cost: ~$1-3 per GB ingested at scale. 100 TB/day = $100K-$300K/day = $36M-$110M/year. Untenable.

The new model: Grep-on-object-storage (Loki, VictoriaLogs)
Don't index. Store logs compressed in object storage. Query by filtering on metadata + brute-force grep on selected files.

Pros: ~10-100x cheaper storage (object vs SSD + index). Simpler ops.
Cons: queries can be slow (10s-minutes for broad searches). Requires good metadata to narrow the search space.

The trade: cheap to store, slower to search. Works well when most logs are written and never read; the rare query pays for the slowness.

The hybrid: Vector + S3 + Athena / Trino
Pipeline logs through a router (Vector, Fluent Bit) to S3 in Parquet format. Query with Athena or Trino.

Pros: cheap storage. SQL queries.
Cons: query latency 10s of seconds. Schema migrations are work.

Indexing selectively
Index high-value fields (request_id, user_id, error_code) for fast lookup. Leave the body unindexed. Query: "find request_id=ABC" is fast (uses index) → "show me the body" is fast (the index points to the location).

ClickHouse is increasingly popular for this: columnar storage with selective indexes, fast on both filter + aggregate.

Sampling logs
Often controversial. Two perspectives:

"Logs are the source of truth; never sample." - misses cost reality.
"Sample like you sample traces." - what most cost-disciplined teams do.

Pragmatic: sample debug-level by 90% (you rarely need it); keep 100% of warn/error. Keep all logs from canary deploys. Sample less aggressively for the most-recent N days; sample more aggressively for the rest.

Structured vs unstructured
JSON logs win. Unstructured logs require parsing at query time, lose information, and are slower to filter.

The hardest battle: convincing teams to change their logging libraries. Tools (logfmt, structlog, Zerolog, slog) make it easy enough that there's no excuse. Platform team enforces via CI lint.

Retention tiers

Hot (7-30 days, indexed): SSD-backed. Fast query.
Warm (30-90 days, object storage with selective indexes): seconds-to-minutes query.
Cold (1+ year, Glacier-class): hours to retrieve. For compliance only.

Lifecycle policy moves data automatically. Engineers must understand which tier they're querying - the platform UI shows it explicitly.

5. OpenTelemetry: why it matters and the migration path

OpenTelemetry (OTel) is the CNCF standard for instrumentation. Within five years, every observability platform will be OTel-native; resisting is a rear-guard action.

What OTel provides

A vendor-neutral instrumentation API (metrics, logs, traces).
A vendor-neutral wire protocol (OTLP).
A vendor-neutral collector (the OTel Collector) that receives, transforms, and exports to any backend.

The "before" world
Each observability vendor had its own SDK. Switching from Datadog to New Relic meant re-instrumenting every service. Vendor lock-in was structural.

The "after" world
Instrument once with OTel. Send to whatever backend you want. Switch backends with a config change.

The collector pattern
The OTel Collector is a stateless agent (or gateway) that:

Accepts OTLP from instrumented apps.
Applies transformations: drop labels, sample, batch, enrich.
Exports to any backend (Prometheus remote_write, Tempo, Loki, Datadog API, etc.).

This decouples the application from the backend. Migrate from one TSDB to another? Update the collector exporter; no app changes.

Auto-instrumentation
For Java, .NET, Python, Node.js, OTel offers auto-instrumentation - attach a Java agent / module loader, get HTTP / DB / messaging traces with zero code changes. Coverage is increasingly broad.

Migration strategy

Deploy OTel Collector alongside existing agents (Datadog agent, Prometheus scrape, etc.). Send same data to both.
Convert services to OTel SDK incrementally. Old services keep their existing instrumentation; new services start with OTel.
After a year of dual-write, retire the old agent.

The migration is ~12-24 months for a large org. Front-loading the cost gives back vendor flexibility for the next decade.

OTel for logs
Newest signal type in OTel. Spec is stable but tooling is less mature than metrics/traces. Most teams continue with their existing log shipper (Vector, Fluent Bit) and integrate via OTel later.

The collector as a control plane
A mature OTel Collector deployment becomes the platform's control plane:

Cardinality limits enforced at the collector.
Sampling rules configured centrally.
Routing rules per signal type.
PII redaction.

This consolidates policy enforcement that previously lived in dozens of agent configs.

Performance considerations
OTel SDKs add 1-5% CPU overhead at typical instrumentation volume. The collector adds 0.5-1% latency on the network path. Both are small relative to the operational benefit.

The interview signal
Candidates who default to "we'd use Datadog" without considering OTel signal lower than candidates who explain when each makes sense. OTel + self-hosted backend is the standard playbook for cost-driven orgs at scale; managed (Datadog / Honeycomb / Lightstep) is faster to ramp.

6. Alerting: dedup, routing, and the on-call's pager

An alerting system that pages too often is ignored. An alerting system that misses incidents is useless. The math of alert design is harder than it looks.

The basic loop
Evaluate rules on a schedule (every 30s for critical, every 5m for non-critical). Each rule produces a series of "firing/not-firing" decisions. Transitions to firing trigger notifications.

Deduplication
A flapping condition (firing → resolved → firing → resolved) shouldn't page on every transition. Standard pattern:

Sustain: the condition must hold for N intervals before alerting (e.g., 3 minutes).
Hysteresis: the resolved threshold is lower than the firing threshold.
Group by: similar alerts batched into one notification.

Alertmanager (Prometheus ecosystem) implements all three. Most engines copy the model.

Routing
Alerts route to teams based on labels (service, team, severity). Critical alerts page; warnings go to a Slack channel; info goes to a dashboard.

The routing tree must be code-reviewed. Teams should not be able to silently route their pages to /dev/null.

Silences
During a planned maintenance, alerts for the affected service get silenced. Silences must:

Have a TTL (auto-expire).
Require justification.
Be auditable.

The "silenced for 30 days because it's noisy" pattern is the tell of a degrading alerting culture.

Alert types

Threshold: simple "metric > X". Brittle - thresholds drift.
Anomaly: ML-based or statistical (3 std-dev, EWMA). Smarter but harder to debug.
Multi-condition: "p99 > 1s AND error rate > 0.1%". Reduces false positives.
SLO burn rate: "we've burned 10% of our error budget in the last hour". Aligns with SRE practice; reduces false positives sharply.

SLO burn-rate alerting is the modern best practice. Ties alerts to user-perceived SLOs, not arbitrary thresholds.

Page volume targets

Critical pages should resolve to <1 per on-call shift on average. >5/shift = burnout.
Warning channel: should be skimmed, not acted on. <50/day per service.

If alerts exceed these, the alerting story is broken. Prune ruthlessly.

The alert lifecycle
Each alert should have:

Runbook link (what to do when paged).
Owner (who owns this alert).
Last firing date (stale alerts that never fire are candidates for deletion).
Last edit date (alerts that haven't been touched in years drift from current behavior).

Auditing alerts quarterly is platform team work. Automation helps; replacement humans are the bottleneck.

Self-monitoring of the alerting system itself
The alerting engine must have its own monitoring (separately deployed). When alerting goes silent for 5 minutes, that's the worst incident - not just service down, but no one knows.

Trade-offs

Self-hosted vs managed (Datadog / Honeycomb / Lightstep)
Managed wins for fast time-to-market and below ~1000 hosts; self-hosted wins for cost (5-10x cheaper at scale) and customization. Most companies start managed and migrate signals one at a time as cost grows.

Indexed logs vs grep-on-object-storage
Indexed (Elasticsearch/Splunk) is fast to query, expensive to store. Grep (Loki, VictoriaLogs) is cheap to store, slower to query. At 100 TB/day, grep is the cost-driven default; pay extra for indexing only on high-value fields.

Head vs tail vs rule-based trace sampling
Head: cheap, dumb. Tail: smart, expensive. Rule-based: focused. Best is hybrid: head-sample to a buffer, tail-decide what to keep. Exemplars bridge sampled traces to unsampled metrics.

Cardinality cap at ingest vs aggregate-and-let-through
Cap is brutal but predictable. Aggregate (replace user_id with user_segment) loses dimensions but preserves the metric. Both are needed; the platform team enforces caps + provides aggregation as the "right" pattern.

TSDB choice (Prometheus / VictoriaMetrics / Mimir / M3)
Below 10M series: Prometheus + remote storage. 10M-100M: VictoriaMetrics or Mimir. 100M+: M3 or managed. Switching is months of work.

OpenTelemetry vs vendor SDK
OTel for vendor neutrality and long-term flexibility; vendor SDK for fastest onboarding. New instrumentation should default OTel; legacy migrations happen incrementally over 1-2 years.

SLO burn-rate alerts vs threshold alerts
SLO burn-rate is more accurate and reduces noise; threshold is simpler to author. Modern teams default to burn-rate for top-level alerts and thresholds for backstop conditions.

Centralized vs per-team alerting
Centralized standardizes runbooks, routing, naming - critical at scale. Per-team is faster to start and lets teams own their own destiny. Most large orgs land on centralized infrastructure with per-team rule ownership.

Observability of observability
Required: a separate, smaller observability stack monitors the main one. Otherwise the platform's own incidents are invisible. The cost is real; the alternative is unacceptable.

Common follow-up questions

Be ready for at least three of these. The first one is almost always asked.

?How would you cut your $5M/year observability bill in half?
?What's your strategy when a single metric blows past a billion time series?
?How do you sample traces in a way that catches every interesting incident?
?How would you migrate from Datadog to a self-hosted stack without losing dashboards?
?What's your story for incident-time queries when the platform itself is degraded?
?How do you handle a sudden 10x log-volume spike from a misconfigured service?
?How would you support compliance retention (1+ year) without bankrupting the team?
?What's your alerting strategy for a service with no SLOs defined?

Companies that test this topic

Practice in interview format

Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.

Start an AI mock interview →