Data Engineering Interview Questions
Practice data engineering topics including Spark, Kafka, dbt, Airflow, modern warehouses (Snowflake / BigQuery / Redshift), CDC, lakehouse table formats (Iceberg / Delta), and schema evolution.
Frequently Asked Questions
What does a typical data engineering interview cover?
Modern data engineering interviews touch four layers: ingestion (CDC vs polling, Fivetran / Airbyte / Debezium), storage (warehouse vs lake vs lakehouse, Iceberg / Delta / Hudi), transformation (dbt models / tests / sources, batch vs streaming with Spark / Flink / Beam), and orchestration (Airflow / Dagster / Prefect). Plus SQL fluency, dimensional modeling (star schema, SCD types), and at least one cloud warehouse (Snowflake, BigQuery, Redshift, or Databricks SQL).
Do I need to know both Spark and dbt?
Most modern teams use both - Spark (or Flink) for big-data and streaming transformations, dbt for SQL-native warehouse modeling. Newer 'small data' teams often skip Spark entirely and run everything in Snowflake/BigQuery via dbt; older teams may run Spark-heavy lakes without dbt. Knowing both makes you flexible; if forced to pick one, dbt has a wider hiring footprint in 2026.
How important is real-time / streaming in interviews?
It depends on the company. Most analytics-focused teams (BI dashboards, finance reporting, attribution) work in batch / micro-batch and rarely need true streaming. Streaming matters for ad tech, fraud detection, IoT, and real-time personalization - if you're interviewing there, expect Kafka deep dives (partitions, consumer groups, exactly-once), event-time vs processing-time, watermarks, and stateful streaming with Flink or Kafka Streams.
What's the difference between a data engineer and an analytics engineer?
Data engineers own the platform: ingestion, infra, pipelines, schemas, and the warehouse itself. Analytics engineers (a dbt-coined role) own the transformation layer: turning raw tables into clean, modeled, tested business datasets that analysts use. Data engineering interviews test infra and pipeline skills more deeply; analytics engineering interviews focus on SQL, dimensional modeling, and dbt.
Why are Iceberg / Delta / Hudi a big deal?
These open table formats add ACID transactions, schema evolution, time travel, and hidden partitioning to Parquet on object storage - the things warehouses had and lakes didn't. They enable the 'lakehouse' pattern: BI, ML, and streaming all read the same files, no warehouse copy. Iceberg's engine-neutrality (Spark / Trino / Snowflake / BigQuery / DuckDB all read it) is the dominant force in 2026.
How should I think about schema evolution and data contracts?
Adding a nullable / default-valued field is safe (forward + backward compatible). Renaming, narrowing types, or removing required fields breaks consumers. Data contracts formalize this: producers declare a versioned schema with an SLA and breaking-change policy, CI verifies producer compatibility, and downstream teams depend on the contract not the implementation. Combine with a schema registry and dbt source freshness for end-to-end durability.