We use cookies for site analytics. Accept to help us understand how the site is used. See our Privacy Policy for details.
Practice data engineering topics including Spark, Kafka, dbt, Airflow, modern warehouses (Snowflake / BigQuery / Redshift), CDC, lakehouse table formats (Iceberg / Delta), and schema evolution.
Modern data engineering interviews touch four layers: ingestion (CDC vs polling, Fivetran / Airbyte / Debezium), storage (warehouse vs lake vs lakehouse, Iceberg / Delta / Hudi), transformation (dbt models / tests / sources, batch vs streaming with Spark / Flink / Beam), and orchestration (Airflow / Dagster / Prefect). Plus SQL fluency, dimensional modeling (star schema, SCD types), and at least one cloud warehouse (Snowflake, BigQuery, Redshift, or Databricks SQL).
Most modern teams use both - Spark (or Flink) for big-data and streaming transformations, dbt for SQL-native warehouse modeling. Newer 'small data' teams often skip Spark entirely and run everything in Snowflake/BigQuery via dbt; older teams may run Spark-heavy lakes without dbt. Knowing both makes you flexible; if forced to pick one, dbt has a wider hiring footprint in 2026.
It depends on the company. Most analytics-focused teams (BI dashboards, finance reporting, attribution) work in batch / micro-batch and rarely need true streaming. Streaming matters for ad tech, fraud detection, IoT, and real-time personalization - if you're interviewing there, expect Kafka deep dives (partitions, consumer groups, exactly-once), event-time vs processing-time, watermarks, and stateful streaming with Flink or Kafka Streams.
Data engineers own the platform: ingestion, infra, pipelines, schemas, and the warehouse itself. Analytics engineers (a dbt-coined role) own the transformation layer: turning raw tables into clean, modeled, tested business datasets that analysts use. Data engineering interviews test infra and pipeline skills more deeply; analytics engineering interviews focus on SQL, dimensional modeling, and dbt.
These open table formats add ACID transactions, schema evolution, time travel, and hidden partitioning to Parquet on object storage - the things warehouses had and lakes didn't. They enable the 'lakehouse' pattern: BI, ML, and streaming all read the same files, no warehouse copy. Iceberg's engine-neutrality (Spark / Trino / Snowflake / BigQuery / DuckDB all read it) is the dominant force in 2026.
Adding a nullable / default-valued field is safe (forward + backward compatible). Renaming, narrowing types, or removing required fields breaks consumers. Data contracts formalize this: producers declare a versioned schema with an SLA and breaking-change policy, CI verifies producer compatibility, and downstream teams depend on the contract not the implementation. Combine with a schema registry and dbt source freshness for end-to-end durability.