Design an Analytics Pipeline (Kafka / Spark / Warehouse)
Batch vs streaming, lambda vs kappa, the warehouse-vs-lakehouse decision, and dimension modeling that survives schema drift.
The problem
Design a pipeline that ingests every user-facing event the product produces, makes the freshest data available for real-time dashboards within seconds, makes the cleanest data available for analyst queries within hours, and never loses an event. Schemas evolve weekly; new event types appear; old ones get repurposed.
This is the systems analog of "design Netflix data infra." The interview signal is whether the candidate can defend lambda vs kappa, separate ingestion from transformation, and explain why the warehouse and the streaming store are not the same database. Strong candidates draw the timeline of a single event from emission to dashboard and identify where it can break.
Clarifying questions
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
- →What's the event volume - 10K, 1M, or 100M events/sec at peak?
- →What latency tiers - real-time (seconds), near-real-time (minutes), batch (hours)?
- →What downstream consumers - dashboards, ML training, ad-hoc SQL, business reporting, third-party export?
- →Schema strictness - schema-on-write (enforced at ingest) or schema-on-read (raw payloads, parsed at query time)?
- →Privacy requirements - PII redaction, GDPR right-to-delete propagating through the pipeline?
- →Multi-tenant - one platform serving many product teams with isolation?
- →Cost ceiling - is the pipeline expected to run at $1M/year or $50M/year of cloud spend?
- →Reprocessing - how often do we need to rerun a transform on historical data after a bug fix?
Requirements
Functional requirements
- ·Ingest events from web, mobile, backend services, third-party webhooks
- ·Real-time path: events queryable on a streaming dashboard within ~10 seconds
- ·Batch path: events available in the warehouse for SQL within ~1 hour
- ·Schema registry with versioning and backward-compatibility rules
- ·Reprocess historical data on demand (replay last 30 days through a fixed transform)
- ·Per-event lineage - track which dataset / dashboard a given source event flows into
- ·GDPR delete - propagate user_id deletes through warm and cold storage within SLA
Non-functional requirements
- Scale
- 10M events/sec at peak. 1 PB ingested/day. 50 PB warehouse. 5,000 active analysts running queries. 200 production dashboards refreshed every 1-60 seconds.
- Latency
- Real-time path p99 < 10s end-to-end (event emission → dashboard). Batch path p95 < 60 minutes (event → warehouse table). Analyst queries p95 < 30s on warm partitions.
- Availability
- 99.9% for ingestion (events must not be lost). 99% for batch (a delayed batch is recoverable). Real-time dashboards can degrade to last-known-good during outages.
- Consistency
- At-least-once ingestion with idempotency at the consumer (event_id dedup). Streaming aggregates are eventually consistent and may be revised when late events arrive. Warehouse tables are strongly consistent within a partition once batch finishes.
Capacity estimation
Storage
- Raw events: 10M/sec × 1KB × 86400s = 864 TB/day. Compressed: ~300 TB/day. 1-year retention = 100 PB raw.
- Warehouse (modeled, partitioned, columnar): ~30% of raw after column pruning + compression = ~30 PB.
- Streaming state (Flink): ~1 TB hot working set per stage; 10 stages = 10 TB across the cluster.
Throughput
- Kafka ingestion: 10M events/sec across ~500 brokers (20K events/sec/broker headroom).
- Stream processing: Flink at 10M events/sec across ~200 task managers, average per-event CPU ~10us.
- Batch jobs: Spark on Iceberg / Delta Lake; 1000 executors at peak; full daily refresh of core marts in ~3 hours.
Network
- Inter-AZ Kafka replication: 3x in-region traffic = 30 GB/s steady. Largest line on the AWS bill.
- Cross-region replication for DR: 1x raw rate = 10 GB/s. Compressed and amortized.
Cost shape
- Storage: warehouse columnar at S3 prices = ~$25K/PB/month → ~$750K/month for warm.
- Compute: Spark + Flink + dashboard query layer ~$400K/month.
- Network: ~$200K/month inter-AZ + cross-region.
Cardinality budget
- Real-time dashboards: ~10K dimension combinations per dashboard. Beyond that, pre-aggregate.
High-level architecture
The pipeline has three layers: collection, processing, serving.
Collection: SDKs and backend services emit events to a thin ingest API; the API writes to Kafka. Kafka is the durable boundary - once an event is in Kafka, the pipeline guarantees delivery; before Kafka, it's the producer's problem.
Processing: two parallel paths from Kafka. The streaming path (Flink / Spark Structured Streaming) computes near-real-time aggregates and writes them to a fast serving store (Pinot / Druid / ClickHouse). The batch path lands raw events in object storage (S3) as Parquet, then Spark / dbt builds modeled tables in a warehouse (Snowflake / BigQuery / Iceberg lakehouse).
Serving: dashboards read from the streaming store for fresh metrics and from the warehouse for historical / modeled queries. A semantic layer routes queries to the right backend.
The lambda architecture (separate batch + streaming with the same logic implemented twice) is a maintenance nightmare. The kappa architecture (everything is streaming, batch is just a long replay) is conceptually clean but operationally hard. Production systems pick one with caveats.
Ingest API
Stateless HTTP / gRPC endpoint. Validates schema, enriches with server-side metadata (timestamp, ip, geo), writes to Kafka. The thinnest possible layer - business logic lives downstream.
Kafka (event log)
Durable, partitioned, replicated. The contract boundary between producers and the rest of the pipeline. Topic per event family or per source. Retention 7-30 days hot.
Schema registry
Holds versioned Avro / Protobuf schemas. Producers register; consumers fetch on read. Enforces backward compatibility on schema evolution.
Stream processor (Flink / Spark Streaming)
Real-time path. Reads from Kafka, computes windowed aggregates, joins, sessionizes. Writes to the serving store and / or back to Kafka for downstream consumers.
Serving store (Pinot / Druid / ClickHouse)
Optimized for sub-second analytical queries on recent data. Columnar, distributed. Powers real-time dashboards.
Object storage (S3)
Lands raw events as partitioned Parquet. Source of truth for batch reprocessing. Cheap at scale.
Lakehouse table format (Iceberg / Delta / Hudi)
Adds ACID, schema evolution, and time-travel on top of S3 Parquet. Lets multiple engines (Spark, Trino, Snowflake) read consistently.
Batch processor (Spark / dbt)
Builds modeled tables - facts, dimensions, marts. Runs on schedule (hourly / daily). Materializes denormalized views the warehouse layer can serve fast.
Warehouse query engine (Snowflake / BigQuery / Trino)
Serves analyst SQL. Scales compute on demand. Caches results. Per-query cost shows up in your bill - encourage modeled tables, not raw scans.
Semantic layer / metric store
Defines metrics once, in code. Routes dashboard queries to streaming store or warehouse based on freshness need. Prevents 10 dashboards from reimplementing the same metric inconsistently.
Lineage / observability
Tracks which raw events feed which datasets and dashboards. Detects schema drift, freshness violations, and broken upstream contracts before downstream users notice.
Deep dives
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
1. Lambda vs kappa: pick one and own the trade-off
Two architectural philosophies. Most companies start lambda, regret it, and migrate toward kappa.
Lambda architecture
Two parallel pipelines - batch and streaming - implementing the same business logic. Streaming gives fresh-but-approximate answers; batch gives delayed-but-correct answers. A serving layer reconciles.
Pros: each path uses the right tool. Streaming engines are simple. Batch engines handle reprocessing well.
Cons: the same logic is implemented twice (Java/Scala for Flink, SQL/Python for Spark). Drift is inevitable. Bug fixes go in twice. Two on-call rotations.
Kappa architecture
Streaming is the only paradigm. Reprocessing is just "rewind Kafka to the start of the window and replay through the same job". One codebase.
Pros: no duplication. Reprocess by replaying. Mental model is simple.
Cons: streaming engines historically struggled with massive backfills (months of replay). Flink and Spark Structured Streaming have closed most of this gap. State stores at petabyte scale are still hard.
The pragmatic answer
For new systems: kappa-leaning. Use Flink or Spark Structured Streaming as the only processor. Keep raw events in Kafka long enough for the realistic backfill window (30 days is usually enough). Use object storage as the long-term archive that you reload into Kafka for >30-day backfills.
For existing lambda systems: don't rewrite for ideology. Migrate one pipeline at a time when it's already due for rewrite. Use the semantic layer to hide which path served a query.
The "streaming is just batch with a small window" view
True at the math level, false at the engineering level. Batch jobs read whole partitions, can shuffle freely, can fail and restart. Streaming jobs hold state continuously, can't restart cheaply, must handle late data and watermarks. The engineering models are different even if the math is similar.
Watermarks and late events
Streaming aggregates have to decide "when is this minute's count finalized?" Watermarks are the answer: "we believe no events older than time T will arrive". Events arriving after the watermark are late - dropped, or routed to a side output for separate handling.
Late events break naive aggregations: a "users active last hour" count published at minute 60 must allow events from minute 59 to arrive at minute 65 and revise the count. The serving store must support upserts.
2. Warehouse vs lakehouse vs OLAP store
Three serving layers, three different jobs. Conflating them is the most common architectural mistake.
Cloud data warehouse (Snowflake, BigQuery, Redshift)
Optimized for analyst SQL on modeled data. Columnar, MPP. Strong concurrency for many users running queries. Compute and storage are decoupled - scale independently.
Strengths: ANSI SQL, semantic features (window functions, CTEs), great UX. Auto-scaling for spiky workloads. Strong governance.
Weaknesses: query latency is seconds-to-minutes - not for live dashboards. Data must be loaded; not great for raw event volumes at petabyte scale.
Lakehouse (Databricks, Iceberg / Delta / Hudi on S3)
Open table formats on object storage. ACID transactions. Schema evolution. Time travel. Multiple engines (Spark, Trino, Snowflake external tables, Athena) read the same data.
Strengths: cheap at petabyte scale ($25/TB/month). No vendor lock-in. ML training reads the same tables as analysts.
Weaknesses: more operational complexity. Concurrency limits when many engines write the same table.
Real-time OLAP store (Pinot, Druid, ClickHouse)
Optimized for sub-second queries on recent, high-cardinality data. Specialized for dashboards where freshness matters more than completeness.
Strengths: ~100ms p99 query latency. Streaming ingestion native. Multi-billion-row tables interactive.
Weaknesses: limited SQL. Schema is rigid. Storage cost higher than the warehouse.
The right choice depends on the query
Real-time dashboard ("active users last 5 minutes") → Pinot.
Analyst "weekly active users by country, last quarter" → warehouse.
ML training "all events last year for feature engineering" → lakehouse.
Reverse ETL ("send this segment to Salesforce") → warehouse.
Production architectures use all three
A unified semantic layer routes queries based on freshness and complexity. The warehouse and lakehouse share table formats (Iceberg) so the same data isn't copied twice. The OLAP store is the freshest tier but holds only days of data.
3. Dimension modeling and the schema-evolution problem
Raw events are uglify. Dimension modeling turns them into something analysts can query without specialized knowledge.
Star schema basics
Facts: numeric, additive measurements (revenue, click count, latency). Wide, deep tables.
Dimensions: descriptive context (user, product, time, geography). Narrower, slowly changing.
Each fact references dimensions by surrogate key. Joins are denormalized in the query layer. Storage is wider than 3NF but query performance and analyst comprehension are massively better.
Slowly changing dimensions (SCD)
A user's country changes. How do we record it?
SCD Type 1: overwrite. Simplest. Loses history.
SCD Type 2: insert a new dimension row with effective_from / effective_to. Preserves history. Joins on facts use the dim row valid at the fact's timestamp.
SCD Type 6: hybrid. Current value + history.
For analytics, Type 2 is the right answer. Type 1 silently rewrites history; reports change retroactively when nobody expects it.
Schema evolution at scale
Events are emitted by hundreds of services across hundreds of teams. Schemas change constantly.
Rules of evolution:
- Adding a new field: backward-compatible (old consumers ignore). Allowed.
- Removing a field: breaks consumers. Soft-delete instead (mark deprecated for 90 days, then remove).
- Renaming a field: not allowed - add new, deprecate old.
- Changing a type: not allowed. Make a new field.
The schema registry enforces these rules at producer registration time. Producers can't push an incompatible schema.
Versioning strategy
Each event has a schema_version field. Consumers read with the schema version specified in the message. Old data remains queryable with old schemas; new data uses new.
Late-arriving dimensions
A fact event references a user that doesn't yet exist in the user dimension table (signup row hasn't synced yet). Two strategies:
- Drop the fact (loss).
- Insert a placeholder dim row (with surrogate key only); update later when the real row arrives.
Placeholder strategy is standard. The fact is preserved; the dim row is filled in eventually. SQL joins on surrogate keys keep working.
Naming conventions
fact_<event_type> (fact_orders, fact_pageviews). dim_<entity> (dim_users, dim_products). agg_<grain> (agg_daily_revenue). Strict naming pays off massively at scale.
One Big Table (OBT) vs star
Modern columnar engines (Snowflake, BigQuery) compress denormalized OBTs almost as well as stars and avoid joins. For warehouses that compute fast joins anyway, OBT is often the right answer at petabyte scale. Argue for it explicitly with cost numbers.
4. Reprocessing and backfill
A bug in the transform layer mis-attributed two weeks of revenue. Recompute everything correctly, without breaking live dashboards.
The reprocess scenarios
- Bug fix: rerun a transform on historical data with corrected logic.
- New derived dataset: build a new mart from existing raw events.
- Schema migration: rebuild a table after a structural change.
- Disaster recovery: warehouse data corrupted; rebuild from raw.
The backfill timeline
- Hours to days: rerun batch jobs against object storage. Spark scales out; cost is linear in data scanned.
- Weeks to months: rerun batch jobs against warm cold storage. Slower; data may need to be rehydrated to hot tier.
- Years: rare. Plan for it but don't optimize for it.
Idempotent jobs
Reprocess relies on idempotency: running the same job twice produces the same output. Achieved by:
- Deterministic logic (no current_time(), no rand()).
- Output partitioned by event_time, not processing_time. Replay overwrites the same partition.
- Upsert semantics on output tables (Iceberg/Delta merge_into).
Replay through Kafka (kappa-style)
Start a parallel job consuming Kafka from offset T0. Job writes to a parallel output table. Once caught up to current, atomically swap the table - rename or change the view.
Cost: the replay job runs alongside production for hours/days. Plan capacity headroom.
Replay from object storage (lambda-style)
Run a batch job against the raw event archive in S3. Reads partitioned Parquet for the affected window. Writes to the corrected output table.
Cost: per-byte S3 read + Spark compute. At petabyte scale, a single full replay can cost $50K-$200K.
Coordinating with downstream consumers
Dashboards backed by the corrected table see updated numbers. Users notice the change. Communicate explicitly: "Revenue dashboard restated for Mar 1-14. Reason: bug in attribution logic. Old numbers are wrong."
This is a comms problem, not a tech problem - but the pipeline must support it.
Streaming reprocess
The kappa promise: reprocess is just replay. In practice:
- Streaming state must be rebuilt - many minutes to hours for stateful jobs (sessionization, deduplication).
- Output writes must be idempotent.
- Throttle replay to avoid overwhelming downstream consumers.
Modern stream processors (Flink savepoints, Spark Structured Streaming checkpoints) make this practical at TB scale, harder at PB.
5. Data quality, freshness, and lineage
An analytics pipeline that produces wrong numbers fast is worse than one that produces no numbers. Quality is part of the architecture, not a layer on top.
Data quality dimensions
- Freshness: how old is the latest data point?
- Completeness: did all expected events land?
- Validity: do values fit expected ranges?
- Uniqueness: are duplicates within tolerance?
- Consistency: do related datasets agree?
Each dimension needs automated checks at the right pipeline stage.
The check ladder
- Schema check at ingest (rejects malformed events).
- Volume check on Kafka topics (alarm if events/sec deviates from forecast).
- Freshness check on output tables (alarm if last write is older than SLA).
- Statistical checks on derived metrics (alarm on anomalies vs prior week).
- Cross-dataset checks (revenue in fact_orders sums to revenue in agg_daily_revenue).
Freshness SLAs
Each output table has a documented SLA (e.g., "fact_orders is current to within 30 minutes 99% of the time"). The lineage system tracks SLA per table; on-call alerts when violated.
Lineage
Track every transform: input tables → output tables. Tools: OpenLineage, Marquez, Datahub.
Use cases:
- Impact analysis: schema change to source X breaks which downstream tables?
- Root cause: dashboard Y is stale - which upstream pipeline failed?
- Compliance: GDPR delete on user_id - propagate to which tables?
Privacy and the GDPR delete problem
A user requests deletion. The user's row in the warehouse must be deleted (or anonymized). But the user's events are baked into 50 derived tables.
Options:
- Soft delete: mark user as deleted; transforms exclude them. Cheap, easy, common.
- Hard delete: rewrite affected partitions. Iceberg / Delta support this; expensive on petabyte tables.
- Crypto-shredding: encrypt PII per-user; on delete, throw away the key. Records remain but are unreadable.
Most companies do (1) for ongoing pipelines and (3) for archive. Few do (2) regularly.
SLAs as code
Freshness, completeness, and quality SLAs go in a spec file checked into the same repo as the transform code. CI runs them daily. Violations create tickets automatically. The on-call engineer doesn't get to decide what's "good enough" ad hoc.
6. Cost: where the money goes and how to cut it
An analytics pipeline at this scale spends $1-10M/month on cloud bills. The biggest line items have outsized leverage.
The cost breakdown (typical)
- Storage: 30% (warehouse + S3 + serving store)
- Compute: 40% (Spark, Flink, query engines)
- Network: 20% (inter-AZ, cross-region, egress)
- Managed-service margins: 10% (Snowflake / BigQuery markup, Databricks platform fee)
Storage levers
- Partitioning: query only the partitions you need. Bad partitioning (no key, or by ingestion time) makes every query a full scan.
- Compression: Parquet + ZSTD vs Snappy vs LZ4. ZSTD wins ~20% over Snappy with modest CPU cost.
- Tiering: warm (last 90 days) on standard storage; cold on Glacier/Infrequent Access. 50% storage cost cut for old data.
- Retention policy: don't keep raw events forever. Most events are valuable for 90 days; tail-of-tail for 1 year.
Compute levers
- Right-size jobs: a job using 100 executors when 10 would do at 5x runtime is fine - shorter tail, lower spot risk.
- Spot / preemptible instances for batch: 70% discount.
- Caching: Snowflake / BigQuery cache query results - reuse them. Materialize hot queries as tables.
- Query-cost attribution: every query has an owner. Public dashboards of "top 100 most expensive queries this week" change behavior.
Network levers
- Single-AZ where possible: inter-AZ Kafka traffic dominates. Co-locate producers and brokers.
- Compression on the wire: gzip / lz4 inter-region.
- Egress avoidance: pull, don't push, when working across clouds.
The chargeback model
Every team that produces or consumes data has a budget. Cost is allocated by source (events emitted), by transform (compute), and by query (BQ slot-seconds, Snowflake credits). Without chargeback, every team optimizes their own thing and the central platform cost grows unbounded.
The death spiral
Pattern: a new dashboard is built on raw events; a few hundred users use it; query costs explode; the platform team fire-fights; underlying issue (no modeled table) is never fixed. Solution: every dashboard must source from a modeled table, not raw. Enforced in the semantic layer.
Trade-offs
Lambda vs kappa
Lambda lets you use the right tool but costs you the same logic implemented twice. Kappa is conceptually clean but operationally hard at extreme scale. Default to kappa-leaning for new systems.
Schema-on-read vs schema-on-write
Schema-on-read (raw JSON, parsed at query time) is flexible but produces dirty queries and slow performance. Schema-on-write (Avro/Protobuf with a registry) is strict but gives strong contracts and fast queries. Production answer: schema-on-write at ingest, with raw archive for historical flexibility.
Streaming serving store vs warehouse for dashboards
Streaming store gives sub-second freshness and sub-second queries but limited SQL. Warehouse gives full SQL and modeled data but minute-level freshness. Use both, route via the semantic layer.
Pull vs push from sources
Pull (CDC, periodic dumps) is simpler but lags. Push (events emitted by services in real time) is fresher but couples application code to the pipeline. Most companies use both - push for high-value events, pull for system-of-record data (orders, users) via CDC.
Strict vs permissive ingestion
Strict (reject malformed events) keeps downstream clean but creates incident pressure on producers. Permissive (accept everything; clean later) creates backlog of bad data. Pick strict for high-value events; permissive (with quarantine) for exploratory.
Centralized platform vs per-team pipelines
Centralized platform gives consistency, shared cost optimization, governance. Per-team pipelines move faster initially but create chaos. Most companies start per-team and consolidate around year 3.
Tight vs loose coupling between batch and streaming
Tight (same code paths via Apache Beam etc.) reduces drift but constrains your tools. Loose (separate codebases) is flexible but drifts. Modern engines (Flink unified batch + streaming, Spark Structured Streaming) are pushing toward tight coupling.
Buy vs build
Snowflake + Fivetran + dbt + Looker is fast to stand up and expensive at scale. Open source (Iceberg + Trino + Airflow + Superset) is cheaper at scale and slower to build. The crossover point is usually 1-5 PB; below that, buy; above, build with managed components.
Common follow-up questions
Be ready for at least three of these. The first one is almost always asked.
- ?How would you cut a $5M/month bill by 30% without losing capabilities?
- ?How do you prevent one bad team's events from poisoning the whole pipeline?
- ?What changes if events must be GDPR-deletable end-to-end?
- ?How would you migrate from a Lambda to a Kappa architecture without downtime?
- ?What's your strategy for a critical dashboard going stale at 3am?
- ?How do you handle a 100x volume spike during a viral product launch?
- ?What's your story for backfilling 6 months of corrected revenue?
- ?How would you support real-time ML feature engineering on the same pipeline?
Related system design topics
Notifications
HardFan-out at write vs read, at-least-once vs exactly-once, dead-letter queues, and the multi-channel delivery problem - one message, ten failure modes.
Search + Autocomplete
HardInverted indexes, BM25 ranking, prefix tries, and the p99 < 100ms latency budget that drives every architectural choice.
Companies that test this topic
Practice in interview format
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →