Medallion Architecture (Bronze/Silver/Gold) Explained
Medallion architecture is a data lakehouse design pattern that organizes data into three distinct layers based on quality and transformation state. Popularized by Databricks, it's become the de facto...
Key Insights
- Medallion architecture provides a structured approach to data quality, with each layer (Bronze → Silver → Gold) progressively refining raw data into analytics-ready assets
- The pattern shines in enterprise environments requiring audit trails, data lineage, and regulatory compliance—but adds unnecessary complexity for simple analytics use cases
- Success depends on treating each layer as a contract: Bronze guarantees completeness, Silver guarantees correctness, and Gold guarantees business relevance
What is Medallion Architecture?
Medallion architecture is a data lakehouse design pattern that organizes data into three distinct layers based on quality and transformation state. Popularized by Databricks, it’s become the de facto standard for teams building modern data platforms on technologies like Delta Lake, Apache Iceberg, and Apache Hudi.
The core idea is simple: data flows through progressive refinement stages. Raw data lands in Bronze, gets cleaned and validated in Silver, then gets aggregated into business-ready datasets in Gold. Each layer serves a specific purpose and maintains different guarantees about data quality.
This isn’t just organizational tidiness. The pattern solves real problems: debugging data issues becomes traceable, reprocessing historical data becomes feasible, and different teams can work at different layers without stepping on each other.
The Bronze Layer: Raw Data Ingestion
Bronze is your landing zone. Data arrives here in its original form—warts and all. You’re capturing exactly what the source system sent, with minimal transformation.
The key principles for Bronze:
- Append-only writes: Never update or delete. Every record that arrives gets persisted.
- Schema-on-read: Don’t enforce strict schemas. Store data as-is and handle schema evolution downstream.
- Metadata enrichment: Add ingestion timestamps, source identifiers, and batch IDs for lineage tracking.
Here’s a practical PySpark example for ingesting raw event data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit, input_file_name
from datetime import datetime
spark = SparkSession.builder \
.appName("BronzeIngestion") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
def ingest_to_bronze(source_path: str, bronze_table: str, source_system: str):
"""
Ingest raw data to bronze layer with metadata enrichment.
Preserves original schema, adds audit columns.
"""
batch_id = datetime.now().strftime("%Y%m%d_%H%M%S")
# Read raw data without schema enforcement
raw_df = spark.read \
.option("mode", "PERMISSIVE") \
.option("columnNameOfCorruptRecord", "_corrupt_record") \
.json(source_path)
# Enrich with ingestion metadata
enriched_df = raw_df \
.withColumn("_ingested_at", current_timestamp()) \
.withColumn("_source_system", lit(source_system)) \
.withColumn("_batch_id", lit(batch_id)) \
.withColumn("_source_file", input_file_name())
# Append to bronze table
enriched_df.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable(bronze_table)
return enriched_df.count()
# Usage
records_ingested = ingest_to_bronze(
source_path="s3://raw-events/clickstream/2024/01/15/",
bronze_table="bronze.clickstream_events",
source_system="web_analytics"
)
Notice the PERMISSIVE mode and corrupt record handling. Bronze should never fail ingestion due to data quality issues—that’s Silver’s job. The mergeSchema option handles evolving source schemas gracefully.
The Silver Layer: Cleaned and Conformed Data
Silver is where data quality enforcement happens. You’re transforming raw data into a reliable, queryable format that downstream consumers can trust.
Silver layer responsibilities:
- Schema enforcement: Apply explicit schemas and data types
- Deduplication: Handle duplicate records from source systems
- Data quality checks: Validate required fields, ranges, and referential integrity
- Standardization: Normalize formats (dates, currencies, identifiers)
from pyspark.sql.functions import col, when, to_timestamp, row_number
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType
# Define explicit schema for silver layer
silver_schema = StructType([
StructField("event_id", StringType(), False),
StructField("user_id", StringType(), False),
StructField("event_type", StringType(), False),
StructField("event_timestamp", TimestampType(), False),
StructField("page_url", StringType(), True),
StructField("session_id", StringType(), True),
StructField("device_type", StringType(), True),
])
def transform_to_silver(bronze_table: str, silver_table: str, watermark: str):
"""
Transform bronze data to silver with quality enforcement.
Processes incrementally based on ingestion timestamp.
"""
# Read new bronze records since last run
bronze_df = spark.table(bronze_table) \
.filter(col("_ingested_at") > watermark)
# Apply transformations and quality rules
cleaned_df = bronze_df \
.filter(col("_corrupt_record").isNull()) \
.withColumn("event_timestamp",
to_timestamp(col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")) \
.withColumn("device_type",
when(col("device_type").isin("mobile", "desktop", "tablet"), col("device_type"))
.otherwise("unknown")) \
.filter(col("event_id").isNotNull()) \
.filter(col("user_id").isNotNull()) \
.filter(col("event_timestamp").isNotNull())
# Deduplicate by event_id, keeping latest ingestion
dedup_window = Window.partitionBy("event_id").orderBy(col("_ingested_at").desc())
deduped_df = cleaned_df \
.withColumn("_row_num", row_number().over(dedup_window)) \
.filter(col("_row_num") == 1) \
.drop("_row_num", "_corrupt_record")
# Select and cast to silver schema
silver_df = deduped_df.select(
col("event_id"),
col("user_id"),
col("event_type"),
col("event_timestamp"),
col("page_url"),
col("session_id"),
col("device_type")
)
# Merge into silver table (upsert pattern)
silver_df.createOrReplaceTempView("updates")
spark.sql(f"""
MERGE INTO {silver_table} AS target
USING updates AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
transform_to_silver(
bronze_table="bronze.clickstream_events",
silver_table="silver.clickstream_events",
watermark="2024-01-15T00:00:00"
)
The MERGE operation is crucial here. Unlike Bronze’s append-only pattern, Silver uses upserts to maintain a clean, deduplicated state while handling late-arriving data correctly.
The Gold Layer: Business-Ready Aggregations
Gold contains curated datasets optimized for specific business use cases. These are the tables your analysts query directly, your dashboards connect to, and your ML models train on.
Gold layer characteristics:
- Pre-aggregated metrics: Reduce query complexity and improve performance
- Domain-specific modeling: Organize around business concepts, not source systems
- Denormalized for consumption: Optimize for read patterns, not storage efficiency
def build_gold_user_engagement(silver_table: str, gold_table: str):
"""
Build gold-layer user engagement summary from silver events.
Aggregates daily metrics per user for dashboard consumption.
"""
silver_df = spark.table(silver_table)
engagement_summary = silver_df \
.withColumn("event_date", col("event_timestamp").cast("date")) \
.groupBy("user_id", "event_date", "device_type") \
.agg(
count("event_id").alias("total_events"),
countDistinct("session_id").alias("unique_sessions"),
countDistinct("page_url").alias("pages_viewed"),
min("event_timestamp").alias("first_activity"),
max("event_timestamp").alias("last_activity"),
sum(when(col("event_type") == "purchase", 1).otherwise(0)).alias("purchases"),
sum(when(col("event_type") == "add_to_cart", 1).otherwise(0)).alias("cart_additions")
) \
.withColumn("session_depth",
col("total_events") / col("unique_sessions")) \
.withColumn("conversion_rate",
col("purchases") / when(col("cart_additions") > 0, col("cart_additions")).otherwise(1))
# Overwrite partition for idempotent rebuilds
engagement_summary.write \
.format("delta") \
.mode("overwrite") \
.option("replaceWhere", f"event_date >= '{min_date}'") \
.saveAsTable(gold_table)
build_gold_user_engagement(
silver_table="silver.clickstream_events",
gold_table="gold.user_engagement_daily"
)
Gold tables are often rebuilt entirely or by partition. Unlike Silver’s incremental upserts, Gold prioritizes simplicity and correctness over efficiency. If aggregation logic changes, you can recompute from Silver without touching Bronze.
Data Flow and Orchestration
Tying the layers together requires robust orchestration. Here’s a Prefect flow that manages the complete pipeline:
from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta
@task(retries=2, cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
def run_bronze_ingestion(source_path: str, bronze_table: str):
return ingest_to_bronze(source_path, bronze_table, "web_analytics")
@task(retries=2)
def run_silver_transformation(bronze_table: str, silver_table: str, watermark: str):
transform_to_silver(bronze_table, silver_table, watermark)
@task(retries=1)
def run_gold_aggregation(silver_table: str, gold_table: str):
build_gold_user_engagement(silver_table, gold_table)
@task
def get_last_watermark(table: str) -> str:
# Retrieve last processed timestamp from metadata store
return spark.sql(f"SELECT max(_ingested_at) FROM {table}").collect()[0][0]
@flow(name="medallion-pipeline")
def medallion_etl_flow(source_date: str):
source_path = f"s3://raw-events/clickstream/{source_date}/"
# Bronze ingestion
bronze_count = run_bronze_ingestion(source_path, "bronze.clickstream_events")
# Silver transformation (depends on bronze)
watermark = get_last_watermark("silver.clickstream_events")
run_silver_transformation(
"bronze.clickstream_events",
"silver.clickstream_events",
watermark,
wait_for=[bronze_count]
)
# Gold aggregation (depends on silver)
run_gold_aggregation(
"silver.clickstream_events",
"gold.user_engagement_daily"
)
# Schedule daily runs
medallion_etl_flow("2024/01/15")
When to Use (and When Not To)
Medallion architecture excels when:
- You need audit trails and data lineage for compliance
- Multiple teams consume data at different quality levels
- Source systems are unreliable or schemas evolve frequently
- You’re building ML pipelines that require reproducible feature engineering
Consider alternatives when:
- You’re a small team with simple analytics needs (just use dbt on a warehouse)
- Your use case is purely real-time streaming (consider Kappa architecture)
- Storage costs are a primary concern (three copies of data adds up)
Common anti-patterns to avoid:
- Skipping Silver and going Bronze → Gold directly (you’ll regret it during debugging)
- Over-engineering Bronze with complex transformations (keep it raw)
- Building too many Gold tables (consolidate around business domains)
Getting Started
Start small. Pick one data source, implement all three layers, and iterate. Use Delta Lake or Apache Iceberg for table format—both provide the ACID guarantees and time travel capabilities that make medallion architecture practical.
Your implementation checklist:
- Set up separate schemas/databases for each layer
- Implement Bronze ingestion with metadata enrichment
- Define Silver quality rules and deduplication logic
- Build one Gold table serving a real business need
- Add orchestration with proper dependency management
- Document data contracts between layers
The medallion pattern isn’t revolutionary—it’s organized pragmatism. Each layer has a clear responsibility, failures are isolated, and debugging becomes systematic rather than chaotic.