Medallion Architecture (Bronze/Silver/Gold) Explained

Key Insights

Medallion architecture provides a structured approach to data quality, with each layer (Bronze → Silver → Gold) progressively refining raw data into analytics-ready assets
The pattern shines in enterprise environments requiring audit trails, data lineage, and regulatory compliance—but adds unnecessary complexity for simple analytics use cases
Success depends on treating each layer as a contract: Bronze guarantees completeness, Silver guarantees correctness, and Gold guarantees business relevance

What is Medallion Architecture?

Medallion architecture is a data lakehouse design pattern that organizes data into three distinct layers based on quality and transformation state. Popularized by Databricks, it’s become the de facto standard for teams building modern data platforms on technologies like Delta Lake, Apache Iceberg, and Apache Hudi.

The core idea is simple: data flows through progressive refinement stages. Raw data lands in Bronze, gets cleaned and validated in Silver, then gets aggregated into business-ready datasets in Gold. Each layer serves a specific purpose and maintains different guarantees about data quality.

This isn’t just organizational tidiness. The pattern solves real problems: debugging data issues becomes traceable, reprocessing historical data becomes feasible, and different teams can work at different layers without stepping on each other.

The Bronze Layer: Raw Data Ingestion

Bronze is your landing zone. Data arrives here in its original form—warts and all. You’re capturing exactly what the source system sent, with minimal transformation.

The key principles for Bronze:

Append-only writes: Never update or delete. Every record that arrives gets persisted.
Schema-on-read: Don’t enforce strict schemas. Store data as-is and handle schema evolution downstream.
Metadata enrichment: Add ingestion timestamps, source identifiers, and batch IDs for lineage tracking.

Here’s a practical PySpark example for ingesting raw event data:

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit, input_file_name
from datetime import datetime

spark = SparkSession.builder \
    .appName("BronzeIngestion") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

def ingest_to_bronze(source_path: str, bronze_table: str, source_system: str):
    """
    Ingest raw data to bronze layer with metadata enrichment.
    Preserves original schema, adds audit columns.
    """
    batch_id = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Read raw data without schema enforcement
    raw_df = spark.read \
        .option("mode", "PERMISSIVE") \
        .option("columnNameOfCorruptRecord", "_corrupt_record") \
        .json(source_path)
    
    # Enrich with ingestion metadata
    enriched_df = raw_df \
        .withColumn("_ingested_at", current_timestamp()) \
        .withColumn("_source_system", lit(source_system)) \
        .withColumn("_batch_id", lit(batch_id)) \
        .withColumn("_source_file", input_file_name())
    
    # Append to bronze table
    enriched_df.write \
        .format("delta") \
        .mode("append") \
        .option("mergeSchema", "true") \
        .saveAsTable(bronze_table)
    
    return enriched_df.count()

# Usage
records_ingested = ingest_to_bronze(
    source_path="s3://raw-events/clickstream/2024/01/15/",
    bronze_table="bronze.clickstream_events",
    source_system="web_analytics"
)

Notice the PERMISSIVE mode and corrupt record handling. Bronze should never fail ingestion due to data quality issues—that’s Silver’s job. The mergeSchema option handles evolving source schemas gracefully.

The Silver Layer: Cleaned and Conformed Data

Silver is where data quality enforcement happens. You’re transforming raw data into a reliable, queryable format that downstream consumers can trust.

Silver layer responsibilities:

Schema enforcement: Apply explicit schemas and data types
Deduplication: Handle duplicate records from source systems
Data quality checks: Validate required fields, ranges, and referential integrity
Standardization: Normalize formats (dates, currencies, identifiers)

from pyspark.sql.functions import col, when, to_timestamp, row_number
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType

# Define explicit schema for silver layer
silver_schema = StructType([
    StructField("event_id", StringType(), False),
    StructField("user_id", StringType(), False),
    StructField("event_type", StringType(), False),
    StructField("event_timestamp", TimestampType(), False),
    StructField("page_url", StringType(), True),
    StructField("session_id", StringType(), True),
    StructField("device_type", StringType(), True),
])

def transform_to_silver(bronze_table: str, silver_table: str, watermark: str):
    """
    Transform bronze data to silver with quality enforcement.
    Processes incrementally based on ingestion timestamp.
    """
    # Read new bronze records since last run
    bronze_df = spark.table(bronze_table) \
        .filter(col("_ingested_at") > watermark)
    
    # Apply transformations and quality rules
    cleaned_df = bronze_df \
        .filter(col("_corrupt_record").isNull()) \
        .withColumn("event_timestamp", 
            to_timestamp(col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")) \
        .withColumn("device_type",
            when(col("device_type").isin("mobile", "desktop", "tablet"), col("device_type"))
            .otherwise("unknown")) \
        .filter(col("event_id").isNotNull()) \
        .filter(col("user_id").isNotNull()) \
        .filter(col("event_timestamp").isNotNull())
    
    # Deduplicate by event_id, keeping latest ingestion
    dedup_window = Window.partitionBy("event_id").orderBy(col("_ingested_at").desc())
    
    deduped_df = cleaned_df \
        .withColumn("_row_num", row_number().over(dedup_window)) \
        .filter(col("_row_num") == 1) \
        .drop("_row_num", "_corrupt_record")
    
    # Select and cast to silver schema
    silver_df = deduped_df.select(
        col("event_id"),
        col("user_id"),
        col("event_type"),
        col("event_timestamp"),
        col("page_url"),
        col("session_id"),
        col("device_type")
    )
    
    # Merge into silver table (upsert pattern)
    silver_df.createOrReplaceTempView("updates")
    
    spark.sql(f"""
        MERGE INTO {silver_table} AS target
        USING updates AS source
        ON target.event_id = source.event_id
        WHEN MATCHED THEN UPDATE SET *
        WHEN NOT MATCHED THEN INSERT *
    """)

transform_to_silver(
    bronze_table="bronze.clickstream_events",
    silver_table="silver.clickstream_events",
    watermark="2024-01-15T00:00:00"
)

The MERGE operation is crucial here. Unlike Bronze’s append-only pattern, Silver uses upserts to maintain a clean, deduplicated state while handling late-arriving data correctly.

The Gold Layer: Business-Ready Aggregations

Gold contains curated datasets optimized for specific business use cases. These are the tables your analysts query directly, your dashboards connect to, and your ML models train on.

Gold layer characteristics:

Pre-aggregated metrics: Reduce query complexity and improve performance
Domain-specific modeling: Organize around business concepts, not source systems
Denormalized for consumption: Optimize for read patterns, not storage efficiency

def build_gold_user_engagement(silver_table: str, gold_table: str):
    """
    Build gold-layer user engagement summary from silver events.
    Aggregates daily metrics per user for dashboard consumption.
    """
    silver_df = spark.table(silver_table)
    
    engagement_summary = silver_df \
        .withColumn("event_date", col("event_timestamp").cast("date")) \
        .groupBy("user_id", "event_date", "device_type") \
        .agg(
            count("event_id").alias("total_events"),
            countDistinct("session_id").alias("unique_sessions"),
            countDistinct("page_url").alias("pages_viewed"),
            min("event_timestamp").alias("first_activity"),
            max("event_timestamp").alias("last_activity"),
            sum(when(col("event_type") == "purchase", 1).otherwise(0)).alias("purchases"),
            sum(when(col("event_type") == "add_to_cart", 1).otherwise(0)).alias("cart_additions")
        ) \
        .withColumn("session_depth", 
            col("total_events") / col("unique_sessions")) \
        .withColumn("conversion_rate",
            col("purchases") / when(col("cart_additions") > 0, col("cart_additions")).otherwise(1))
    
    # Overwrite partition for idempotent rebuilds
    engagement_summary.write \
        .format("delta") \
        .mode("overwrite") \
        .option("replaceWhere", f"event_date >= '{min_date}'") \
        .saveAsTable(gold_table)

build_gold_user_engagement(
    silver_table="silver.clickstream_events",
    gold_table="gold.user_engagement_daily"
)

Gold tables are often rebuilt entirely or by partition. Unlike Silver’s incremental upserts, Gold prioritizes simplicity and correctness over efficiency. If aggregation logic changes, you can recompute from Silver without touching Bronze.

Data Flow and Orchestration

Tying the layers together requires robust orchestration. Here’s a Prefect flow that manages the complete pipeline:

from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta

@task(retries=2, cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
def run_bronze_ingestion(source_path: str, bronze_table: str):
    return ingest_to_bronze(source_path, bronze_table, "web_analytics")

@task(retries=2)
def run_silver_transformation(bronze_table: str, silver_table: str, watermark: str):
    transform_to_silver(bronze_table, silver_table, watermark)

@task(retries=1)
def run_gold_aggregation(silver_table: str, gold_table: str):
    build_gold_user_engagement(silver_table, gold_table)

@task
def get_last_watermark(table: str) -> str:
    # Retrieve last processed timestamp from metadata store
    return spark.sql(f"SELECT max(_ingested_at) FROM {table}").collect()[0][0]

@flow(name="medallion-pipeline")
def medallion_etl_flow(source_date: str):
    source_path = f"s3://raw-events/clickstream/{source_date}/"
    
    # Bronze ingestion
    bronze_count = run_bronze_ingestion(source_path, "bronze.clickstream_events")
    
    # Silver transformation (depends on bronze)
    watermark = get_last_watermark("silver.clickstream_events")
    run_silver_transformation(
        "bronze.clickstream_events",
        "silver.clickstream_events", 
        watermark,
        wait_for=[bronze_count]
    )
    
    # Gold aggregation (depends on silver)
    run_gold_aggregation(
        "silver.clickstream_events",
        "gold.user_engagement_daily"
    )

# Schedule daily runs
medallion_etl_flow("2024/01/15")

When to Use (and When Not To)

Medallion architecture excels when:

You need audit trails and data lineage for compliance
Multiple teams consume data at different quality levels
Source systems are unreliable or schemas evolve frequently
You’re building ML pipelines that require reproducible feature engineering

Consider alternatives when:

You’re a small team with simple analytics needs (just use dbt on a warehouse)
Your use case is purely real-time streaming (consider Kappa architecture)
Storage costs are a primary concern (three copies of data adds up)

Common anti-patterns to avoid:

Skipping Silver and going Bronze → Gold directly (you’ll regret it during debugging)
Over-engineering Bronze with complex transformations (keep it raw)
Building too many Gold tables (consolidate around business domains)

Getting Started

Start small. Pick one data source, implement all three layers, and iterate. Use Delta Lake or Apache Iceberg for table format—both provide the ACID guarantees and time travel capabilities that make medallion architecture practical.

Your implementation checklist:

Set up separate schemas/databases for each layer
Implement Bronze ingestion with metadata enrichment
Define Silver quality rules and deduplication logic
Build one Gold table serving a real business need
Add orchestration with proper dependency management
Document data contracts between layers

The medallion pattern isn’t revolutionary—it’s organized pragmatism. Each layer has a clear responsibility, failures are isolated, and debugging becomes systematic rather than chaotic.