We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Batch vs streaming, lambda vs kappa, the warehouse-vs-lakehouse decision, and dimension modeling that survives schema drift.
Design a pipeline that ingests every user-facing event the product produces, makes the freshest data available for real-time dashboards within seconds, makes the cleanest data available for analyst queries within hours, and never loses an event. Schemas evolve weekly; new event types appear; old ones get repurposed.
This is the systems analog of "design Netflix data infra." The interview signal is whether the candidate can defend lambda vs kappa, separate ingestion from transformation, and explain why the warehouse and the streaming store are not the same database. Strong candidates draw the timeline of a single event from emission to dashboard and identify where it can break.
Asking these before diving into a solution is the difference between a "hire" and a "no signal" rating. Pick the questions whose answers would change your design.
Storage
Throughput
Network
Cost shape
Cardinality budget
The pipeline has three layers: collection, processing, serving.
Collection: SDKs and backend services emit events to a thin ingest API; the API writes to Kafka. Kafka is the durable boundary - once an event is in Kafka, the pipeline guarantees delivery; before Kafka, it's the producer's problem.
Processing: two parallel paths from Kafka. The streaming path (Flink / Spark Structured Streaming) computes near-real-time aggregates and writes them to a fast serving store (Pinot / Druid / ClickHouse). The batch path lands raw events in object storage (S3) as Parquet, then Spark / dbt builds modeled tables in a warehouse (Snowflake / BigQuery / Iceberg lakehouse).
Serving: dashboards read from the streaming store for fresh metrics and from the warehouse for historical / modeled queries. A semantic layer routes queries to the right backend.
The lambda architecture (separate batch + streaming with the same logic implemented twice) is a maintenance nightmare. The kappa architecture (everything is streaming, batch is just a long replay) is conceptually clean but operationally hard. Production systems pick one with caveats.
Stateless HTTP / gRPC endpoint. Validates schema, enriches with server-side metadata (timestamp, ip, geo), writes to Kafka. The thinnest possible layer - business logic lives downstream.
Durable, partitioned, replicated. The contract boundary between producers and the rest of the pipeline. Topic per event family or per source. Retention 7-30 days hot.
Holds versioned Avro / Protobuf schemas. Producers register; consumers fetch on read. Enforces backward compatibility on schema evolution.
Real-time path. Reads from Kafka, computes windowed aggregates, joins, sessionizes. Writes to the serving store and / or back to Kafka for downstream consumers.
Optimized for sub-second analytical queries on recent data. Columnar, distributed. Powers real-time dashboards.
Lands raw events as partitioned Parquet. Source of truth for batch reprocessing. Cheap at scale.
Adds ACID, schema evolution, and time-travel on top of S3 Parquet. Lets multiple engines (Spark, Trino, Snowflake) read consistently.
Builds modeled tables - facts, dimensions, marts. Runs on schedule (hourly / daily). Materializes denormalized views the warehouse layer can serve fast.
Serves analyst SQL. Scales compute on demand. Caches results. Per-query cost shows up in your bill - encourage modeled tables, not raw scans.
Defines metrics once, in code. Routes dashboard queries to streaming store or warehouse based on freshness need. Prevents 10 dashboards from reimplementing the same metric inconsistently.
Tracks which raw events feed which datasets and dashboards. Detects schema drift, freshness violations, and broken upstream contracts before downstream users notice.
The subsystems where the interview is actually decided. Skim if you're running short; own these if you want a strong signal.
Two architectural philosophies. Most companies start lambda, regret it, and migrate toward kappa.
Lambda architecture
Two parallel pipelines - batch and streaming - implementing the same business logic. Streaming gives fresh-but-approximate answers; batch gives delayed-but-correct answers. A serving layer reconciles.
Pros: each path uses the right tool. Streaming engines are simple. Batch engines handle reprocessing well.
Cons: the same logic is implemented twice (Java/Scala for Flink, SQL/Python for Spark). Drift is inevitable. Bug fixes go in twice. Two on-call rotations.
Kappa architecture
Streaming is the only paradigm. Reprocessing is just "rewind Kafka to the start of the window and replay through the same job". One codebase.
Pros: no duplication. Reprocess by replaying. Mental model is simple.
Cons: streaming engines historically struggled with massive backfills (months of replay). Flink and Spark Structured Streaming have closed most of this gap. State stores at petabyte scale are still hard.
The pragmatic answer
For new systems: kappa-leaning. Use Flink or Spark Structured Streaming as the only processor. Keep raw events in Kafka long enough for the realistic backfill window (30 days is usually enough). Use object storage as the long-term archive that you reload into Kafka for >30-day backfills.
For existing lambda systems: don't rewrite for ideology. Migrate one pipeline at a time when it's already due for rewrite. Use the semantic layer to hide which path served a query.
The "streaming is just batch with a small window" view
True at the math level, false at the engineering level. Batch jobs read whole partitions, can shuffle freely, can fail and restart. Streaming jobs hold state continuously, can't restart cheaply, must handle late data and watermarks. The engineering models are different even if the math is similar.
Watermarks and late events
Streaming aggregates have to decide "when is this minute's count finalized?" Watermarks are the answer: "we believe no events older than time T will arrive". Events arriving after the watermark are late - dropped, or routed to a side output for separate handling.
Late events break naive aggregations: a "users active last hour" count published at minute 60 must allow events from minute 59 to arrive at minute 65 and revise the count. The serving store must support upserts.
Three serving layers, three different jobs. Conflating them is the most common architectural mistake.
Cloud data warehouse (Snowflake, BigQuery, Redshift)
Optimized for analyst SQL on modeled data. Columnar, MPP. Strong concurrency for many users running queries. Compute and storage are decoupled - scale independently.
Strengths: ANSI SQL, semantic features (window functions, CTEs), great UX. Auto-scaling for spiky workloads. Strong governance.
Weaknesses: query latency is seconds-to-minutes - not for live dashboards. Data must be loaded; not great for raw event volumes at petabyte scale.
Lakehouse (Databricks, Iceberg / Delta / Hudi on S3)
Open table formats on object storage. ACID transactions. Schema evolution. Time travel. Multiple engines (Spark, Trino, Snowflake external tables, Athena) read the same data.
Strengths: cheap at petabyte scale ($25/TB/month). No vendor lock-in. ML training reads the same tables as analysts.
Weaknesses: more operational complexity. Concurrency limits when many engines write the same table.
Real-time OLAP store (Pinot, Druid, ClickHouse)
Optimized for sub-second queries on recent, high-cardinality data. Specialized for dashboards where freshness matters more than completeness.
Strengths: ~100ms p99 query latency. Streaming ingestion native. Multi-billion-row tables interactive.
Weaknesses: limited SQL. Schema is rigid. Storage cost higher than the warehouse.
The right choice depends on the query
Real-time dashboard ("active users last 5 minutes") → Pinot.
Analyst "weekly active users by country, last quarter" → warehouse.
ML training "all events last year for feature engineering" → lakehouse.
Reverse ETL ("send this segment to Salesforce") → warehouse.
Production architectures use all three
A unified semantic layer routes queries based on freshness and complexity. The warehouse and lakehouse share table formats (Iceberg) so the same data isn't copied twice. The OLAP store is the freshest tier but holds only days of data.
Raw events are uglify. Dimension modeling turns them into something analysts can query without specialized knowledge.
Star schema basics
Facts: numeric, additive measurements (revenue, click count, latency). Wide, deep tables.
Dimensions: descriptive context (user, product, time, geography). Narrower, slowly changing.
Each fact references dimensions by surrogate key. Joins are denormalized in the query layer. Storage is wider than 3NF but query performance and analyst comprehension are massively better.
Slowly changing dimensions (SCD)
A user's country changes. How do we record it?
SCD Type 1: overwrite. Simplest. Loses history.
SCD Type 2: insert a new dimension row with effective_from / effective_to. Preserves history. Joins on facts use the dim row valid at the fact's timestamp.
SCD Type 6: hybrid. Current value + history.
For analytics, Type 2 is the right answer. Type 1 silently rewrites history; reports change retroactively when nobody expects it.
Schema evolution at scale
Events are emitted by hundreds of services across hundreds of teams. Schemas change constantly.
Rules of evolution:
The schema registry enforces these rules at producer registration time. Producers can't push an incompatible schema.
Versioning strategy
Each event has a schema_version field. Consumers read with the schema version specified in the message. Old data remains queryable with old schemas; new data uses new.
Late-arriving dimensions
A fact event references a user that doesn't yet exist in the user dimension table (signup row hasn't synced yet). Two strategies:
Placeholder strategy is standard. The fact is preserved; the dim row is filled in eventually. SQL joins on surrogate keys keep working.
Naming conventions
fact_<event_type> (fact_orders, fact_pageviews). dim_<entity> (dim_users, dim_products). agg_<grain> (agg_daily_revenue). Strict naming pays off massively at scale.
One Big Table (OBT) vs star
Modern columnar engines (Snowflake, BigQuery) compress denormalized OBTs almost as well as stars and avoid joins. For warehouses that compute fast joins anyway, OBT is often the right answer at petabyte scale. Argue for it explicitly with cost numbers.
A bug in the transform layer mis-attributed two weeks of revenue. Recompute everything correctly, without breaking live dashboards.
The reprocess scenarios
The backfill timeline
Idempotent jobs
Reprocess relies on idempotency: running the same job twice produces the same output. Achieved by:
Replay through Kafka (kappa-style)
Start a parallel job consuming Kafka from offset T0. Job writes to a parallel output table. Once caught up to current, atomically swap the table - rename or change the view.
Cost: the replay job runs alongside production for hours/days. Plan capacity headroom.
Replay from object storage (lambda-style)
Run a batch job against the raw event archive in S3. Reads partitioned Parquet for the affected window. Writes to the corrected output table.
Cost: per-byte S3 read + Spark compute. At petabyte scale, a single full replay can cost $50K-$200K.
Coordinating with downstream consumers
Dashboards backed by the corrected table see updated numbers. Users notice the change. Communicate explicitly: "Revenue dashboard restated for Mar 1-14. Reason: bug in attribution logic. Old numbers are wrong."
This is a comms problem, not a tech problem - but the pipeline must support it.
Streaming reprocess
The kappa promise: reprocess is just replay. In practice:
Modern stream processors (Flink savepoints, Spark Structured Streaming checkpoints) make this practical at TB scale, harder at PB.
An analytics pipeline that produces wrong numbers fast is worse than one that produces no numbers. Quality is part of the architecture, not a layer on top.
Data quality dimensions
Each dimension needs automated checks at the right pipeline stage.
The check ladder
Freshness SLAs
Each output table has a documented SLA (e.g., "fact_orders is current to within 30 minutes 99% of the time"). The lineage system tracks SLA per table; on-call alerts when violated.
Lineage
Track every transform: input tables → output tables. Tools: OpenLineage, Marquez, Datahub.
Use cases:
Privacy and the GDPR delete problem
A user requests deletion. The user's row in the warehouse must be deleted (or anonymized). But the user's events are baked into 50 derived tables.
Options:
Most companies do (1) for ongoing pipelines and (3) for archive. Few do (2) regularly.
SLAs as code
Freshness, completeness, and quality SLAs go in a spec file checked into the same repo as the transform code. CI runs them daily. Violations create tickets automatically. The on-call engineer doesn't get to decide what's "good enough" ad hoc.
An analytics pipeline at this scale spends $1-10M/month on cloud bills. The biggest line items have outsized leverage.
The cost breakdown (typical)
Storage levers
Compute levers
Network levers
The chargeback model
Every team that produces or consumes data has a budget. Cost is allocated by source (events emitted), by transform (compute), and by query (BQ slot-seconds, Snowflake credits). Without chargeback, every team optimizes their own thing and the central platform cost grows unbounded.
The death spiral
Pattern: a new dashboard is built on raw events; a few hundred users use it; query costs explode; the platform team fire-fights; underlying issue (no modeled table) is never fixed. Solution: every dashboard must source from a modeled table, not raw. Enforced in the semantic layer.
Lambda vs kappa
Lambda lets you use the right tool but costs you the same logic implemented twice. Kappa is conceptually clean but operationally hard at extreme scale. Default to kappa-leaning for new systems.
Schema-on-read vs schema-on-write
Schema-on-read (raw JSON, parsed at query time) is flexible but produces dirty queries and slow performance. Schema-on-write (Avro/Protobuf with a registry) is strict but gives strong contracts and fast queries. Production answer: schema-on-write at ingest, with raw archive for historical flexibility.
Streaming serving store vs warehouse for dashboards
Streaming store gives sub-second freshness and sub-second queries but limited SQL. Warehouse gives full SQL and modeled data but minute-level freshness. Use both, route via the semantic layer.
Pull vs push from sources
Pull (CDC, periodic dumps) is simpler but lags. Push (events emitted by services in real time) is fresher but couples application code to the pipeline. Most companies use both - push for high-value events, pull for system-of-record data (orders, users) via CDC.
Strict vs permissive ingestion
Strict (reject malformed events) keeps downstream clean but creates incident pressure on producers. Permissive (accept everything; clean later) creates backlog of bad data. Pick strict for high-value events; permissive (with quarantine) for exploratory.
Centralized platform vs per-team pipelines
Centralized platform gives consistency, shared cost optimization, governance. Per-team pipelines move faster initially but create chaos. Most companies start per-team and consolidate around year 3.
Tight vs loose coupling between batch and streaming
Tight (same code paths via Apache Beam etc.) reduces drift but constrains your tools. Loose (separate codebases) is flexible but drifts. Modern engines (Flink unified batch + streaming, Spark Structured Streaming) are pushing toward tight coupling.
Buy vs build
Snowflake + Fivetran + dbt + Looker is fast to stand up and expensive at scale. Open source (Iceberg + Trino + Airflow + Superset) is cheaper at scale and slower to build. The crossover point is usually 1-5 PB; below that, buy; above, build with managed components.
Be ready for at least three of these. The first one is almost always asked.
Fan-out at write vs read, at-least-once vs exactly-once, dead-letter queues, and the multi-channel delivery problem - one message, ten failure modes.
Inverted indexes, BM25 ranking, prefix tries, and the p99 < 100ms latency budget that drives every architectural choice.
Reading is the floor. The interview signal is in walking through this live with someone probing follow-ups. Use the AI mock interview to practice talking through requirements, architecture, and trade-offs out loud.
Start an AI mock interview →