gitGood.dev

Data Engineering Interview Questions

Practice data engineering topics including Spark, Kafka, dbt, Airflow, modern warehouses (Snowflake / BigQuery / Redshift), CDC, lakehouse table formats (Iceberg / Delta), and schema evolution.

29
Total Questions
4
Easy
16
Medium
9
Hard
Showing 1-20 of 29 questionsPage 1 of 2
Sign up to start practicing these questionsSign up free →
Spark RDDs vs DataFrames
QuizMedium
Spark Broadcast Joins
QuizMedium
Spark Partitioning
QuizHard
Kafka Partition Keys
QuizMedium
Kafka Consumer Groups
QuizEasy
Kafka Exactly-Once Semantics
QuizHard
dbt Materializations
QuizMedium
dbt Generic Tests
QuizEasy
dbt Sources vs Models
QuizMedium
Airflow XComs
QuizMedium
Airflow Sensor Modes
QuizHard
Airflow Task Idempotency
QuizMedium
Snowflake Compute Model
QuizMedium
BigQuery Partitioning + Clustering
QuizHard
Redshift Distribution Styles
QuizHard
Change Data Capture
QuizMedium
Iceberg vs Delta Lake
QuizHard
Lakehouse Architecture
QuizMedium
Schema Evolution
QuizMedium
Batch vs Streaming
QuizMedium

Frequently Asked Questions

What does a typical data engineering interview cover?

Modern data engineering interviews touch four layers: ingestion (CDC vs polling, Fivetran / Airbyte / Debezium), storage (warehouse vs lake vs lakehouse, Iceberg / Delta / Hudi), transformation (dbt models / tests / sources, batch vs streaming with Spark / Flink / Beam), and orchestration (Airflow / Dagster / Prefect). Plus SQL fluency, dimensional modeling (star schema, SCD types), and at least one cloud warehouse (Snowflake, BigQuery, Redshift, or Databricks SQL).

Do I need to know both Spark and dbt?

Most modern teams use both - Spark (or Flink) for big-data and streaming transformations, dbt for SQL-native warehouse modeling. Newer 'small data' teams often skip Spark entirely and run everything in Snowflake/BigQuery via dbt; older teams may run Spark-heavy lakes without dbt. Knowing both makes you flexible; if forced to pick one, dbt has a wider hiring footprint in 2026.

How important is real-time / streaming in interviews?

It depends on the company. Most analytics-focused teams (BI dashboards, finance reporting, attribution) work in batch / micro-batch and rarely need true streaming. Streaming matters for ad tech, fraud detection, IoT, and real-time personalization - if you're interviewing there, expect Kafka deep dives (partitions, consumer groups, exactly-once), event-time vs processing-time, watermarks, and stateful streaming with Flink or Kafka Streams.

What's the difference between a data engineer and an analytics engineer?

Data engineers own the platform: ingestion, infra, pipelines, schemas, and the warehouse itself. Analytics engineers (a dbt-coined role) own the transformation layer: turning raw tables into clean, modeled, tested business datasets that analysts use. Data engineering interviews test infra and pipeline skills more deeply; analytics engineering interviews focus on SQL, dimensional modeling, and dbt.

Why are Iceberg / Delta / Hudi a big deal?

These open table formats add ACID transactions, schema evolution, time travel, and hidden partitioning to Parquet on object storage - the things warehouses had and lakes didn't. They enable the 'lakehouse' pattern: BI, ML, and streaming all read the same files, no warehouse copy. Iceberg's engine-neutrality (Spark / Trino / Snowflake / BigQuery / DuckDB all read it) is the dominant force in 2026.

How should I think about schema evolution and data contracts?

Adding a nullable / default-valued field is safe (forward + backward compatible). Renaming, narrowing types, or removing required fields breaks consumers. Data contracts formalize this: producers declare a versioned schema with an SLA and breaking-change policy, CI verifies producer compatibility, and downstream teams depend on the contract not the implementation. Combine with a schema registry and dbt source freshness for end-to-end durability.

Explore Other Categories