Top 50 Data Engineering Interview Questions 2026

The data engineering job has shifted in 2026. The old skill set (write a Spark job, schedule it on Airflow, dump it to a warehouse) is now table stakes. The new bar is "design a system that does not bankrupt the company at scale, that does not silently drop data, and that the analytics team can self-serve on top of."

This list reflects that. SQL is still the heaviest weighted category because it always is. Then pipelines, then storage and modeling, then streaming, then the system design questions that have become the differentiator at the senior level.

Part 1: SQL Deep Dive (1-12)

1. What is the difference between INNER, LEFT, RIGHT, and FULL OUTER JOIN?

INNER - rows where both sides match.
LEFT - all rows from left, NULL where right does not match.
RIGHT - mirror of LEFT.
FULL OUTER - all rows from both sides, NULL where one is missing.

CROSS JOIN is also worth knowing - the cartesian product, useful for date-spine generation.

2. What is a window function?

A function that operates on a "window" of rows around the current row, without collapsing them like a GROUP BY would. ROW_NUMBER(), RANK(), LAG(), LEAD(), SUM() OVER (PARTITION BY ...). The most-asked DE concept in 2026 because it cleanly answers "rolling sum," "first/last per group," "rank within group."

3. Write a query to find the second-highest salary in a table.

SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Or with window functions:

SELECT salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employees
) ranked
WHERE rk = 2;

4. What is a CTE and when should you prefer it over a subquery?

Common Table Expression: WITH cte AS (...) SELECT ... FROM cte. Prefer over subqueries for readability when the same intermediate result is used multiple times or when query logic gets nested. Most modern engines optimize CTEs as well as subqueries.

5. What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?

For ties:

ROW_NUMBER - 1, 2, 3, 4
RANK - 1, 1, 3, 4 (gaps after ties)
DENSE_RANK - 1, 1, 2, 3 (no gaps)

6. Write a query to find duplicate rows.

SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

To find the actual duplicate rows:

SELECT *
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) AS rn
  FROM users
) t
WHERE rn > 1;

7. What does HAVING do that WHERE cannot?

WHERE filters rows before aggregation. HAVING filters groups after aggregation. You can use aggregate functions in HAVING but not in WHERE.

8. What is the difference between UNION and UNION ALL?

UNION removes duplicates. UNION ALL does not. UNION ALL is faster - use it unless you specifically need deduplication.

9. How do you delete duplicate rows while keeping one copy?

Most engines support:

DELETE FROM users
WHERE id NOT IN (
  SELECT MIN(id) FROM users GROUP BY email
);

Or with window functions in engines that support it:

DELETE FROM users
WHERE id IN (
  SELECT id FROM (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) AS rn
    FROM users
  ) t WHERE rn > 1
);

10. Write a query to find users who placed an order in January but not in February.

SELECT DISTINCT user_id
FROM orders
WHERE order_date BETWEEN '2026-01-01' AND '2026-01-31'
  AND user_id NOT IN (
    SELECT DISTINCT user_id
    FROM orders
    WHERE order_date BETWEEN '2026-02-01' AND '2026-02-28'
  );

LEFT JOIN ... WHERE NULL and EXCEPT are also valid. Be ready to discuss trade-offs.

11. What is a correlated subquery and why can it be slow?

A subquery that references a column from the outer query. Often executes once per outer row, which can be O(n²). Modern optimizers convert many correlated subqueries into joins, but not always - rewrite to a JOIN when performance matters.

12. What is the difference between DELETE, TRUNCATE, and DROP?

DELETE removes rows, can be filtered, logs each deletion, fires triggers.
TRUNCATE removes all rows fast, no per-row logging, usually does not fire triggers.
DROP removes the table entirely.

Part 2: Data Modeling and Warehousing (13-22)

13. What is a star schema?

A central fact table surrounded by dimension tables. Facts hold measurable events (sales, clicks). Dimensions hold descriptive attributes (customer, product, time). Star schema is the most common analytics layout.

14. What is a snowflake schema?

A star schema where dimensions are normalized into multiple tables. Saves space, costs joins. Most modern warehouses (Snowflake, BigQuery, Redshift) prefer star or wide tables because joins are cheap and storage is not the bottleneck.

15. What is a slowly changing dimension (SCD)?

A dimension whose values change over time (customer's address). SCD types:

Type 1: overwrite (lose history).
Type 2: add a new row with effective date ranges (keep history).
Type 3: add a new column for the previous value.

Most analytics shops standardize on Type 2 with valid_from and valid_to columns.

16. What is dimensional modeling vs Data Vault?

Dimensional (Kimball) is fact + dimensions, optimized for queries. Data Vault separates business keys, descriptive attributes (sat), and relationships (link), optimized for source-of-truth integrity. Most teams use dimensional for analytics layer, optionally with a Data Vault layer underneath for raw integration.

17. What is a fact table grain?

The level of detail at which a fact is recorded. "One row per order" vs "one row per order line." Mixing grains in the same fact table is a common error and breaks aggregations.

18. What is denormalization and why do warehouses prefer it?

Combining tables that would be normalized in OLTP. Warehouses are columnar - selecting a few columns from a wide table is cheap, while joins can be expensive. Most modern marts are denormalized "wide" tables for query performance.

19. What is partitioning vs clustering in a warehouse?

Partitioning - data physically split by a column (typically date). Pruning at query time skips entire partitions.
Clustering / sort keys - data within a partition is sorted by another column for range scans.

In BigQuery and Snowflake, partitioning is on date and clustering on high-cardinality filter columns is the standard pattern.

20. How do columnar storage formats (Parquet, ORC) differ from row-oriented?

Columnar stores all values of one column together, then the next column. Compresses well (similar values), great for analytics queries that touch a few columns out of many. Row-oriented is better for OLTP where you fetch entire records.

21. What is a data lake vs data warehouse vs lakehouse?

Data lake - raw files in object storage, schema-on-read.
Data warehouse - structured, schema-on-write, optimized for queries.
Lakehouse - lake storage with warehouse semantics via table formats like Delta, Iceberg, or Hudi.

In 2026 the lakehouse pattern is dominant for new builds.

22. What is Apache Iceberg and why does it matter?

An open table format on top of Parquet (or other formats) that adds ACID transactions, schema evolution, time travel, and partition evolution. Lets you treat object storage as a queryable, transactional database. Snowflake, BigQuery, Athena, and Spark all support it. The 2026 default for lakehouse builds.

Part 3: Pipelines and Orchestration (23-32)

23. What is the difference between batch and streaming?

Batch processes finite data on a schedule. Streaming processes infinite data in continuous time. The line blurs with "micro-batch" approaches. Most real systems are mixed: batch for the heavy lifting, streaming for the latency-sensitive paths.

24. What is idempotency in a pipeline and why does it matter?

Running the pipeline twice produces the same result. Necessary because pipelines retry, restart, and replay. Implementation: use idempotency keys, upserts instead of inserts, deterministic processing logic.

25. What is at-least-once vs exactly-once delivery?

At-most-once - each message processed zero or one time. Loses data on failure.
At-least-once - each message processed one or more times. Requires idempotent consumers.
Exactly-once - each message processed exactly once. Hard, expensive, requires coordination between source and sink.

In 2026 most systems aim for at-least-once with idempotent processing, which gives you exactly-once semantics in practice.

26. What is Apache Airflow and what alternatives exist?

Airflow is a Python-based workflow orchestrator. DAGs of tasks, scheduling, retries, monitoring. Alternatives: Prefect (more Pythonic), Dagster (asset-oriented, increasingly popular in 2026), Argo Workflows (Kubernetes-native), Temporal (durable execution model).

27. What is a DAG?

Directed Acyclic Graph. Tasks with dependencies, no cycles. The model for batch workflows in Airflow, dbt, Dagster.

28. What is dbt and how does it fit into the modern data stack?

dbt is a SQL transformation layer that runs on top of your warehouse. Define models as SELECT statements with Jinja templating, dbt builds the DAG, runs them in dependency order, and tests. The default for transformation in 2026.

29. What is the difference between dbt models, sources, snapshots, and seeds?

Models - SELECT statements that build tables/views.
Sources - declarations of raw tables that feed your models.
Snapshots - SCD Type 2 tracking of source tables.
Seeds - small CSV files version-controlled with the project.

30. What is a CDC pipeline?

Change Data Capture. Reads the database transaction log and emits a stream of inserts/updates/deletes. Tools: Debezium, AWS DMS, Fivetran (managed). Foundation of most "near-real-time" data architectures in 2026.

31. How do you handle schema evolution in a pipeline?

Strategies:

Backward-compatible changes (add nullable columns) at the source.
Schema registry (Confluent, AWS Glue) to enforce contracts at write time.
Iceberg / Delta table formats that support add/rename/drop column.
Loud failures with explicit migrations rather than silent type coercion.

32. What is backfill and how do you do it safely?

Re-running a pipeline over historical data, often after a bug fix or schema change. Safety: idempotent operations, explicit date range parameters, separate environment or table when possible, monitoring of runtime/cost. Backfilling 2 years of data on production at 2 a.m. without anyone noticing is a real pattern; have an answer for "what guardrails would you add."

Part 4: Streaming and Real-Time (33-40)

33. What is Apache Kafka and why is it the de facto streaming platform?

A distributed log. Producers write to topics, consumers read at their own pace. Persistent, partitioned, replicated. Decouples producers and consumers, gives you replayability, scales horizontally. Alternatives: AWS Kinesis, GCP Pub/Sub, Redpanda (Kafka-compatible).

34. What is a Kafka partition and how do you choose a partition key?

A topic is split into partitions for parallelism. The partition key determines which partition a message lands in. Within a partition, order is preserved; across partitions, it is not. Choose a key that distributes evenly and that does not require cross-partition ordering for your consumers.

35. What are watermarks in stream processing?

A heuristic for "we have seen all data up to time T." Used to decide when to close a window and emit aggregates. Late-arriving data either gets dropped, side-output, or triggers a window update depending on configuration.

36. What is the difference between event time and processing time?

Event time - when the event actually occurred (e.g., the timestamp on a log line).
Processing time - when the event reaches the pipeline.

Always prefer event time for correctness. Processing-time windows look correct in normal conditions and silently miscount during outages or backfills.

37. What are tumbling, hopping, and session windows?

Tumbling - fixed-size, non-overlapping (every 5 minutes).
Hopping - fixed-size, overlapping (every 5 minutes, hopping every 1 minute).
Session - dynamic, based on inactivity gaps.

38. What is exactly-once in Kafka?

Idempotent producers + transactional writes mean a message is written to a topic exactly once even with retries. Pair with read-committed consumers for end-to-end exactly-once. The catch: you need transactional sinks too, otherwise it is "exactly-once into Kafka, at-least-once into the database."

39. What is Apache Flink and how does it compare to Spark Streaming?

Flink is a true streaming engine with sub-second latency. Spark Streaming is micro-batch. Flink is typically chosen when latency matters and you want stateful processing with exactly-once semantics. Spark is chosen when you already have a Spark stack.

40. What is the Lambda architecture, and why has it largely been replaced?

Lambda = batch layer + speed layer + serving layer. Replaced by Kappa (single streaming pipeline that can be reprocessed) and modern lakehouse approaches that unify batch and streaming. Maintenance of two parallel pipelines is the cost that killed Lambda.

Part 5: System Design and Cost (41-50)

41. Design a pipeline that ingests 100K events/sec from a mobile app.

Outline:

Ingest: Kafka or Kinesis with sufficient partitions for throughput and parallelism.
Buffer: at-least-once durability before any processing.
Stream processor: Flink or Kafka Streams for windowed aggregations.
Storage: write raw to object storage (S3) for replay; write aggregates to a serving store (Redis or DynamoDB) for low-latency reads; write enriched data to the warehouse for analytics.
Schema: schema registry, contract-tested.
Monitoring: lag, error rate, end-to-end latency.

Trade-offs to discuss: cost vs latency, exactly-once cost, cold-storage policies.

42. How do you reduce warehouse query cost?

Partition pruning (queries filter on partition column).
Materialized views for repeated aggregations.
Smaller virtual warehouses / scaled compute.
Schedule heavy queries during off-peak.
Move analyst queries onto a smaller compute pool with quotas.
Cache frequent queries.

In 2026, warehouse costs are the second-largest cloud line item after compute for most data orgs - have a clear point of view on this.

43. What is data quality and how do you measure it?

Dimensions: completeness, accuracy, consistency, timeliness, uniqueness, validity. Tools: dbt tests, Great Expectations, Soda, Monte Carlo. Implementation: tests in CI, runtime checks on critical tables, alerting tied to data SLOs.

44. What is a data contract?

A formal agreement between data producer and consumer about schema, semantics, freshness, and ownership. The 2026 fix for "the upstream team renamed a column and broke 14 dashboards." Tools: Apache Iceberg schema, Avro with a registry, dbt contracts, Atlan / Collibra catalog.

45. What is data lineage and why does it matter?

Tracking where data came from and where it flowed. Required for incident response, regulatory compliance, and impact analysis. dbt produces lineage natively. OpenLineage is the open standard.

46. What is the difference between OLTP and OLAP?

OLTP - Online Transaction Processing. Many small reads/writes, row-oriented, normalized (Postgres, MySQL).
OLAP - Online Analytical Processing. Few large scans, columnar, denormalized (Snowflake, BigQuery).

The boundary blurs with HTAP systems (Singlestore, TiDB).

47. How do you handle PII in a data warehouse?

Tagged columns at the catalog level. Row-level and column-level access controls. Dynamic data masking for non-privileged consumers. Tokenization or pseudonymization for analytics use. Purge policies tied to legal retention. In regulated industries, separate datasets per region.

48. What is a feature store?

A specialized data system for ML features. Two interfaces: low-latency online lookup for inference, batch interface for training. Ensures train/serve consistency. Tools: Feast, Tecton, AWS SageMaker Feature Store.

49. How do you run a pipeline migration without downtime?

Dual-write or dual-read pattern. Run new and old in parallel, compare outputs, cut over once parity is verified. Roll back is easy because the old pipeline is still running. Schedule old-pipeline shutdown only after a soak period.

50. What metrics would you track for a healthy data platform?

Service-level: pipeline success rate, end-to-end latency, freshness lag per table, query failure rate.
Cost: $ per pipeline run, $ per table refresh, $ per analyst-week.
Quality: tests passing rate, anomaly count, schema-change incidents.
Adoption: active users, queries per dataset, dashboard freshness.

If you can articulate a service-level agreement / objective story for your data platform, you are senior.

How to Drill These

The bar for data engineering interviews in 2026 is "you can write SQL in your sleep, you have hands-on with Spark or Kafka or both, you have shipped a real pipeline end-to-end, and you can talk about cost and quality with a straight face."

Practical reps:

Build a personal pipeline end-to-end: ingest a public API into S3, transform with dbt or Spark, serve via a small dashboard. Document trade-offs.
Read The Data Engineering Cookbook for a free curriculum.
Practice on real data sets - Kaggle, NYC Taxi, GitHub Archive.

If you want to drill these under interview pressure, gitGood has a Data Engineering question bank, system design walkthroughs, and a chat-based AI mock interview where an AI interviewer probes your answers with follow-ups.

#dataengineering #sql #spark #airflow #dbt #interviews #career