Data Lake Architecture with Apache Spark

Key Insights

The medallion architecture (Bronze/Silver/Gold layers) provides a proven pattern for organizing data lakes that balances raw data preservation with query performance and data quality
Delta Lake or Apache Iceberg are non-negotiable for production data lakes—plain Parquet lacks the ACID transactions and schema evolution capabilities you’ll inevitably need
Spark’s unified batch and streaming APIs let you build pipelines that work identically whether processing historical data or real-time events, reducing architectural complexity significantly

Why Data Lakes Still Matter

Data warehouses are excellent for structured, well-defined analytical workloads. But they fall apart when you need to store raw event streams, unstructured documents, or data whose schema you don’t fully understand yet. Data lakes solve this by embracing schema-on-read: store everything first, figure out the structure later.

Apache Spark has become the de facto processing engine for data lakes because it handles both batch and streaming workloads, scales horizontally, and integrates with virtually every storage system. More importantly, Spark’s DataFrame API provides a consistent programming model whether you’re processing gigabytes or petabytes.

The modern data lake isn’t just a dumping ground for files. It’s a structured system with clear data zones, governance controls, and optimized query patterns. Let’s build one.

Data Lake Storage Layers

The medallion architecture organizes your data lake into three distinct layers, each serving a specific purpose.

Bronze layer contains raw, unprocessed data exactly as it arrived from source systems. Never transform data here—you want a complete audit trail. This layer uses append-only writes and retains data indefinitely or according to compliance requirements.

Silver layer holds cleaned, validated, and deduplicated data. You’ll apply schema enforcement, handle null values, and join related datasets. This is where most of your ETL logic lives.

Gold layer contains business-level aggregates and curated datasets optimized for specific use cases. Think summary tables, feature stores for ML, or pre-computed metrics for dashboards.

Here’s how to configure Spark for this architecture:

from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

builder = SparkSession.builder \
    .appName("DataLakePipeline") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.databricks.delta.schema.autoMerge.enabled", "true")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

# Define layer paths with consistent partitioning strategy
BRONZE_PATH = "s3://data-lake/bronze/events"
SILVER_PATH = "s3://data-lake/silver/events"
GOLD_PATH = "s3://data-lake/gold/daily_metrics"

# Write to bronze with date partitioning for efficient pruning
raw_events.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("event_date") \
    .option("mergeSchema", "true") \
    .save(BRONZE_PATH)

Partition by columns you’ll frequently filter on—typically date fields. Over-partitioning creates too many small files; under-partitioning forces full scans. Aim for partition sizes between 100MB and 1GB.

Data Ingestion Patterns with Spark

Batch ingestion handles historical loads and scheduled refreshes. Streaming ingestion processes data continuously as it arrives. Spark’s Structured Streaming uses the same DataFrame API for both, which means your transformation logic stays identical.

The auto-loader pattern automatically detects and processes new files as they land in cloud storage:

from pyspark.sql.functions import current_timestamp, input_file_name

# Streaming ingestion with auto-loader
raw_stream = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "s3://data-lake/schemas/events") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
    .load("s3://landing-zone/events/")

# Add ingestion metadata
enriched_stream = raw_stream \
    .withColumn("_ingested_at", current_timestamp()) \
    .withColumn("_source_file", input_file_name())

# Write to bronze layer with checkpointing
query = enriched_stream.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://data-lake/checkpoints/events") \
    .option("mergeSchema", "true") \
    .partitionBy("event_date") \
    .trigger(availableNow=True)  # Process all available, then stop
    .start(BRONZE_PATH)

query.awaitTermination()

The availableNow trigger is useful for scheduled batch jobs that should process all accumulated files then exit. For true real-time processing, use processingTime="10 seconds" instead.

Always include _ingested_at and _source_file metadata columns. You’ll thank yourself during debugging and data quality investigations.

Data Transformation and Processing

The silver layer is where transformation logic lives. Use Delta Lake’s merge operation for upserts—it handles the complexity of matching existing records and applying updates atomically.

from delta.tables import DeltaTable
from pyspark.sql.functions import col, when, coalesce, sha2, concat_ws

# Read from bronze
bronze_df = spark.read.format("delta").load(BRONZE_PATH) \
    .filter(col("event_date") >= "2024-01-01")

# Apply transformations
silver_df = bronze_df \
    .dropDuplicates(["event_id"]) \
    .withColumn("user_id", 
        when(col("user_id").isNull(), "anonymous")
        .otherwise(col("user_id"))) \
    .withColumn("event_value", 
        coalesce(col("event_value"), lit(0.0))) \
    .withColumn("row_hash", 
        sha2(concat_ws("||", col("event_id"), col("user_id"), col("event_value")), 256))

# Merge into silver layer (upsert pattern)
if DeltaTable.isDeltaTable(spark, SILVER_PATH):
    silver_table = DeltaTable.forPath(spark, SILVER_PATH)
    
    silver_table.alias("target") \
        .merge(
            silver_df.alias("source"),
            "target.event_id = source.event_id"
        ) \
        .whenMatchedUpdate(
            condition="target.row_hash != source.row_hash",
            set={"*": "source.*"}
        ) \
        .whenNotMatchedInsertAll() \
        .execute()
else:
    silver_df.write.format("delta").partitionBy("event_date").save(SILVER_PATH)

# Data quality check
quality_metrics = spark.sql(f"""
    SELECT 
        COUNT(*) as total_records,
        COUNT(DISTINCT event_id) as unique_events,
        SUM(CASE WHEN user_id = 'anonymous' THEN 1 ELSE 0 END) as anonymous_events,
        SUM(CASE WHEN event_value < 0 THEN 1 ELSE 0 END) as negative_values
    FROM delta.`{SILVER_PATH}`
    WHERE event_date = current_date()
""")

quality_metrics.show()

The row hash comparison prevents unnecessary updates when data hasn’t actually changed, reducing write amplification and improving performance.

Query and Access Patterns

Gold layer queries should be fast. Apply these optimizations aggressively:

# Create optimized gold table with Z-ordering
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS gold.daily_user_metrics
    USING DELTA
    LOCATION '{GOLD_PATH}'
    PARTITIONED BY (metric_date)
    AS SELECT
        date_trunc('day', event_timestamp) as metric_date,
        user_id,
        COUNT(*) as event_count,
        SUM(event_value) as total_value,
        COUNT(DISTINCT session_id) as session_count
    FROM delta.`{SILVER_PATH}`
    GROUP BY 1, 2
""")

# Optimize for common query patterns
spark.sql(f"""
    OPTIMIZE gold.daily_user_metrics
    ZORDER BY (user_id)
""")

# Query with partition pruning and predicate pushdown
optimized_query = spark.sql("""
    SELECT 
        user_id,
        SUM(event_count) as total_events,
        AVG(total_value) as avg_value
    FROM gold.daily_user_metrics
    WHERE metric_date BETWEEN '2024-01-01' AND '2024-03-31'
      AND user_id IN ('user_123', 'user_456', 'user_789')
    GROUP BY user_id
""")

# Verify pushdown is working
optimized_query.explain(True)

Z-ordering co-locates related data physically, dramatically improving query performance when filtering on those columns. Apply it to columns you frequently filter or join on, but not partition columns (they’re already optimized).

Governance and Metadata Management

A data lake without a catalog is just a file system. Register your tables properly:

# Register tables in the catalog
spark.sql("CREATE DATABASE IF NOT EXISTS bronze")
spark.sql("CREATE DATABASE IF NOT EXISTS silver")
spark.sql("CREATE DATABASE IF NOT EXISTS gold")

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS silver.events
    USING DELTA
    LOCATION '{SILVER_PATH}'
""")

# Add table properties for documentation
spark.sql("""
    ALTER TABLE silver.events 
    SET TBLPROPERTIES (
        'delta.autoOptimize.optimizeWrite' = 'true',
        'delta.autoOptimize.autoCompact' = 'true',
        'owner' = 'data-engineering',
        'pii_columns' = 'user_id,email',
        'retention_days' = '365'
    )
""")

# Query table history for lineage
spark.sql("DESCRIBE HISTORY silver.events").show(truncate=False)

Operational Considerations

Cluster sizing: Start with fewer, larger executors rather than many small ones. Memory overhead per executor is fixed, so consolidation improves efficiency. A good starting point is 4-8 cores and 32GB memory per executor.

File compaction: Small files kill performance. Schedule regular OPTIMIZE jobs, or enable auto-compaction. Target file sizes of 128MB-1GB.

Cost control: Use spot instances for batch workloads (70% savings). Set up auto-termination for idle clusters. Monitor shuffle spill—it indicates memory pressure and wasted I/O.

Common pitfalls: Don’t use collect() on large datasets. Avoid UDFs when built-in functions exist—they break optimization. Always specify schemas explicitly for production pipelines rather than relying on inference.

The data lake architecture I’ve outlined here scales from startup to enterprise. Start with the medallion pattern, use Delta Lake from day one, and invest in proper cataloging early. The technical debt from skipping these foundations compounds quickly as data volumes grow.