Data Engineering Interview Questions

SQL remains the foundation of data engineering interviews. Expect questions that go beyond basic SELECT statements into complex joins, window functions, and performance analysis.

Key Insights

  • Data engineering interviews test depth across SQL, system design, and distributed computing—surface-level knowledge won’t cut it when interviewers dig into query optimization or failure handling strategies.
  • The best candidates demonstrate not just technical skills but also the ability to reason about trade-offs: normalized vs. denormalized schemas, batch vs. streaming, and cost vs. performance.
  • Behavioral questions matter more than many candidates expect; your ability to explain how you debugged a production pipeline failure or collaborated with stakeholders often determines hiring decisions.

SQL and Query Optimization

SQL remains the foundation of data engineering interviews. Expect questions that go beyond basic SELECT statements into complex joins, window functions, and performance analysis.

A common interview pattern involves deduplication with ranking:

-- Find the most recent order per customer, handling ties
WITH ranked_orders AS (
    SELECT 
        customer_id,
        order_id,
        order_date,
        total_amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id 
            ORDER BY order_date DESC, order_id DESC
        ) AS rn
    FROM orders
    WHERE order_status = 'completed'
)
SELECT 
    customer_id,
    order_id,
    order_date,
    total_amount
FROM ranked_orders
WHERE rn = 1;

Interviewers often follow up by asking about the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). Know these cold.

Running totals and cumulative calculations appear frequently:

-- Calculate running total of daily revenue with 7-day moving average
SELECT 
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date) AS cumulative_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM daily_sales;

When asked about query optimization, walk through your approach systematically. Start with EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT c.customer_name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_id, c.customer_name;

Look for sequential scans on large tables, high row estimates vs. actual rows (indicating stale statistics), and nested loop joins where hash joins would be more efficient. Discuss indexing strategies, but acknowledge that indexes have write-performance costs.

Data Modeling and Schema Design

Dimensional modeling questions test your understanding of analytical database design. Be prepared to explain when you’d choose a star schema over a snowflake schema—star schemas are generally preferred for query simplicity and performance, while snowflake schemas reduce storage through normalization.

Here’s a star schema implementation you should be able to write from memory:

-- Dimension table with surrogate key
CREATE TABLE dim_customer (
    customer_sk SERIAL PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,
    customer_name VARCHAR(255),
    email VARCHAR(255),
    segment VARCHAR(50),
    effective_date DATE NOT NULL,
    expiration_date DATE,
    is_current BOOLEAN DEFAULT TRUE
);

-- Fact table referencing dimensions
CREATE TABLE fact_orders (
    order_sk SERIAL PRIMARY KEY,
    customer_sk INT REFERENCES dim_customer(customer_sk),
    product_sk INT REFERENCES dim_product(product_sk),
    date_sk INT REFERENCES dim_date(date_sk),
    order_quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(12,2)
);

Slowly Changing Dimensions (SCD) questions are interview favorites. Type 2 SCD preserves history:

-- SCD Type 2: Close existing record and insert new one
UPDATE dim_customer
SET expiration_date = CURRENT_DATE - 1,
    is_current = FALSE
WHERE customer_id = 'CUST-001' 
  AND is_current = TRUE;

INSERT INTO dim_customer (customer_id, customer_name, email, segment, effective_date, is_current)
VALUES ('CUST-001', 'John Smith', 'john.new@email.com', 'Premium', CURRENT_DATE, TRUE);

Know the trade-offs: Type 1 overwrites history (simple but loses data), Type 2 preserves history (complex queries but full auditability), Type 3 stores limited history in additional columns.

ETL/ELT Pipeline Design

Pipeline design questions assess your ability to build reliable, maintainable data systems. Idempotency is non-negotiable—running a pipeline twice should produce the same result.

import hashlib
from datetime import datetime
from typing import Optional
import logging

class ETLPipeline:
    def __init__(self, source_conn, target_conn):
        self.source = source_conn
        self.target = target_conn
        self.logger = logging.getLogger(__name__)
    
    def extract_with_checkpointing(self, table: str, checkpoint_col: str, 
                                    last_checkpoint: Optional[datetime] = None):
        """Incremental extraction with checkpoint tracking."""
        query = f"""
            SELECT * FROM {table}
            WHERE {checkpoint_col} > %s
            ORDER BY {checkpoint_col}
        """
        checkpoint = last_checkpoint or datetime.min
        
        try:
            return self.source.execute(query, (checkpoint,))
        except Exception as e:
            self.logger.error(f"Extraction failed: {e}")
            raise
    
    def load_idempotent(self, records: list, target_table: str, key_columns: list):
        """Upsert pattern for idempotent loads."""
        for record in records:
            record_hash = self._compute_hash(record, key_columns)
            
            # Check if record exists and has changed
            existing = self.target.get_by_keys(target_table, record, key_columns)
            
            if existing is None:
                self.target.insert(target_table, record)
            elif existing['_hash'] != record_hash:
                self.target.update(target_table, record, key_columns)
            # Skip unchanged records
    
    def _compute_hash(self, record: dict, key_columns: list) -> str:
        data = {k: v for k, v in record.items() if k not in key_columns}
        return hashlib.md5(str(sorted(data.items())).encode()).hexdigest()

For orchestration, understand DAG design principles:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.external_task import ExternalTaskSensor
from datetime import datetime, timedelta

default_args = {
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'retry_exponential_backoff': True,
}

with DAG(
    'daily_sales_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    catchup=False,
    max_active_runs=1,
) as dag:
    
    wait_for_source = ExternalTaskSensor(
        task_id='wait_for_source_data',
        external_dag_id='source_ingestion',
        timeout=3600,
    )
    
    extract = PythonOperator(
        task_id='extract_sales_data',
        python_callable=extract_sales,
    )
    
    transform = PythonOperator(
        task_id='transform_sales_data',
        python_callable=transform_sales,
    )
    
    wait_for_source >> extract >> transform

Big Data Processing Fundamentals

Distributed computing questions probe your understanding of how data moves across clusters. Partitioning strategy directly impacts performance:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month

spark = SparkSession.builder.appName("SalesProcessing").getOrCreate()

# Read with predicate pushdown
sales_df = spark.read.parquet("s3://data-lake/raw/sales/") \
    .filter(col("order_date") >= "2024-01-01")

# Repartition by a high-cardinality column before aggregation
# to avoid data skew
aggregated = sales_df \
    .repartition(200, col("product_category")) \
    .groupBy("product_category", "region") \
    .agg({"total_amount": "sum", "order_id": "count"})

# Write with partitioning for efficient downstream queries
aggregated.write \
    .partitionBy("region") \
    .mode("overwrite") \
    .parquet("s3://data-lake/processed/sales_summary/")

Understand when shuffles occur (joins, groupBy, repartition) and how to minimize them. Know the difference between repartition() (full shuffle) and coalesce() (no shuffle, only reduces partitions).

For streaming, understand basic consumer patterns:

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'events',
    bootstrap_servers=['kafka:9092'],
    group_id='event-processor',
    auto_offset_reset='earliest',
    enable_auto_commit=False,
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    try:
        process_event(message.value)
        consumer.commit()
    except Exception as e:
        # Dead letter queue pattern
        send_to_dlq(message, str(e))
        consumer.commit()

Data Quality and Testing

Data quality questions reveal whether you’ve operated production systems. Great Expectations is the standard tool:

import great_expectations as gx

context = gx.get_context()

# Define expectations programmatically
validator = context.sources.pandas_default.read_csv("orders.csv")

validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_between("total_amount", min_value=0)
validator.expect_column_values_to_match_regex("email", r"^[\w\.-]+@[\w\.-]+\.\w+$")

results = validator.validate()

For dbt projects, tests are declarative:

# models/schema.yml
version: 2

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: order_total
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Cloud Data Platforms and Architecture

Know the trade-offs between major platforms. Snowflake offers simplicity and separation of compute/storage. BigQuery excels at ad-hoc queries with serverless pricing. Redshift integrates tightly with AWS services.

Infrastructure-as-code for a basic data lake:

# Terraform for AWS data lake foundation
resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake-${var.environment}"
}

resource "aws_s3_bucket_lifecycle_configuration" "tiering" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 365
      storage_class = "GLACIER"
    }
  }
}

resource "aws_glue_catalog_database" "analytics" {
  name = "analytics_${var.environment}"
}

Behavioral and Problem-Solving Scenarios

Technical skills get you to the final round; behavioral questions determine the offer. Use the STAR method (Situation, Task, Action, Result) but keep responses under two minutes.

Common scenarios include: “Tell me about a time a pipeline failed in production.” Describe the incident, your debugging process, the fix, and—critically—what you did to prevent recurrence.

“How do you handle conflicting requirements from stakeholders?” Show that you gather context, quantify trade-offs, and drive toward decisions rather than endless discussion.

“Describe a data quality issue you discovered and resolved.” Demonstrate systematic thinking: how you detected the issue, root cause analysis, immediate fix, and long-term prevention.

Prepare three to four stories covering failure handling, cross-team collaboration, ambiguous requirements, and technical leadership. Map each story to multiple question types—one good incident response story can answer questions about debugging, communication, and process improvement.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.