Delta Lake vs Apache Iceberg vs Apache Hudi

Data lakes promised cheap, scalable storage. They delivered chaos instead. Without transactional guarantees, teams faced corrupt reads during writes, no way to roll back bad data, and partition...

Key Insights

  • Delta Lake excels in Spark-native environments and offers the smoothest experience for Databricks users, but its tight coupling to Spark historically limited multi-engine adoption—though UniForm is changing this rapidly.
  • Apache Iceberg has emerged as the industry favorite for multi-engine analytics, with superior partition evolution and the broadest vendor support across cloud data platforms.
  • Apache Hudi’s record-level indexing and timeline architecture make it the strongest choice for CDC pipelines and streaming upsert workloads, though it carries more operational complexity.

The Rise of Open Table Formats

Data lakes promised cheap, scalable storage. They delivered chaos instead. Without transactional guarantees, teams faced corrupt reads during writes, no way to roll back bad data, and partition schemes that couldn’t evolve without rewriting petabytes.

Open table formats solve this by adding a metadata layer on top of Parquet (or ORC) files. They provide ACID transactions, schema evolution, time travel, and partition pruning—all while keeping your data in open formats on object storage.

Three formats dominate: Delta Lake (Databricks), Apache Iceberg (Netflix, now broadly adopted), and Apache Hudi (Uber). Each emerged from different pain points, and those origins shape their strengths today. Choosing wrong means either migrating later or living with suboptimal performance for your workload.

Architecture & Core Concepts Comparison

Understanding how each format tracks changes reveals why they behave differently under various workloads.

Delta Lake uses a transaction log (_delta_log/) containing JSON files that record every operation. Each commit creates a new JSON file, and periodic checkpoints consolidate these into Parquet for faster reads. The log is append-only and uses optimistic concurrency control.

Apache Iceberg takes a snapshot-based approach with three metadata layers: metadata files point to manifest lists, which point to manifest files, which track individual data files. This hierarchy enables efficient partition pruning without listing directories.

Apache Hudi organizes data around a timeline (.hoodie/) that records all operations as instants. It distinguishes between commits, deltacommits, compactions, and cleans. Hudi also maintains indexes for record-level operations.

Here’s how the file structure differs for identical data:

# Delta Lake structure
my_table/
├── _delta_log/
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   └── 00000000000000000010.checkpoint.parquet
├── part-00000-abc123.parquet
└── part-00001-def456.parquet

# Apache Iceberg structure
my_table/
├── metadata/
│   ├── v1.metadata.json
│   ├── snap-123456789.avro
│   └── abc123-m0.avro
└── data/
    ├── part-00000-abc123.parquet
    └── part-00001-def456.parquet

# Apache Hudi structure
my_table/
├── .hoodie/
│   ├── hoodie.properties
│   ├── 20240115120000.commit
│   └── 20240115120000.commit.requested
├── partition=2024-01/
│   ├── file1_0_20240115120000.parquet
│   └── .file1_0_20240115120000.log
└── partition=2024-02/
    └── file2_0_20240115120000.parquet

Iceberg’s separation of metadata from data enables catalog-level operations without touching data files. Hudi’s log files alongside base files support its merge-on-read mode. Delta’s simpler structure works well but requires the transaction log for any operation.

Key Feature Comparison

Schema evolution reveals fundamental design differences. All three support adding columns, but handling renames and type changes varies significantly.

# Delta Lake schema evolution
from delta.tables import DeltaTable

# Enable column mapping for renames (Delta 2.0+)
spark.sql("""
    ALTER TABLE delta.`/path/to/table` 
    SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')
""")

# Rename column
spark.sql("ALTER TABLE delta.`/path/to/table` RENAME COLUMN old_name TO new_name")

# Add column with merge schema
df.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("/path/to/table")
# Apache Iceberg schema evolution
from pyspark.sql import SparkSession

spark.sql("ALTER TABLE iceberg_catalog.db.my_table RENAME COLUMN old_name TO new_name")
spark.sql("ALTER TABLE iceberg_catalog.db.my_table ALTER COLUMN id TYPE bigint")
spark.sql("ALTER TABLE iceberg_catalog.db.my_table ADD COLUMN new_col string AFTER existing_col")

# Partition evolution - no rewrite needed
spark.sql("ALTER TABLE iceberg_catalog.db.my_table ADD PARTITION FIELD month(event_date)")
# Apache Hudi schema evolution
hudi_options = {
    'hoodie.table.name': 'my_table',
    'hoodie.datasource.write.recordkey.field': 'id',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.schema.on.read.enable': 'true',  # Enable schema evolution
    'hoodie.datasource.write.reconcile.schema': 'true'
}

# Write with evolved schema - new columns added automatically
df_with_new_columns.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save("/path/to/table")

Iceberg’s partition evolution stands out. You can change partition schemes without rewriting data—new files use the new scheme while old files retain their original partitioning. Delta and Hudi require rewrites for partition changes.

For concurrency, Delta uses optimistic concurrency with conflict detection at commit time. Iceberg supports both optimistic and pessimistic locking depending on the catalog. Hudi uses timeline-based concurrency with configurable lock providers.

Write & Read Performance Characteristics

Each format optimizes for different access patterns.

Hudi excels at streaming upserts and CDC workloads. Its record-level indexing (bloom filters, HBase, or bucket indexes) enables efficient updates without scanning entire partitions:

# Hudi upsert with record-level indexing
hudi_upsert_options = {
    'hoodie.table.name': 'events',
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.recordkey.field': 'event_id',
    'hoodie.datasource.write.precombine.field': 'event_time',
    'hoodie.index.type': 'BLOOM',
    'hoodie.bloom.index.update.partition.path': 'true'
}

streaming_df.writeStream \
    .format("hudi") \
    .options(**hudi_upsert_options) \
    .option("checkpointLocation", "/checkpoint/events") \
    .start("/data/events")

Iceberg optimizes for analytical reads with hidden partitioning and predicate pushdown that works across partition evolution:

# Iceberg merge with efficient predicate pushdown
spark.sql("""
    MERGE INTO iceberg_catalog.db.target t
    USING iceberg_catalog.db.source s
    ON t.id = s.id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

Delta Lake provides the tightest Spark integration with features like Z-ordering for multi-dimensional clustering:

# Delta merge with Z-order optimization
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/data/events")

delta_table.alias("t").merge(
    updates_df.alias("s"),
    "t.event_id = s.event_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Optimize with Z-ordering for query patterns
spark.sql("OPTIMIZE delta.`/data/events` ZORDER BY (user_id, event_date)")

Small file handling differs too. Delta’s OPTIMIZE command compacts files. Iceberg uses compaction through rewrite operations. Hudi has built-in compaction for merge-on-read tables that runs inline or asynchronously.

Ecosystem & Engine Compatibility

This is where Iceberg has pulled ahead. Its engine compatibility matrix is the broadest:

-- Querying Iceberg from Spark
SELECT * FROM iceberg_catalog.db.events 
WHERE event_date >= '2024-01-01';

-- Same table from Trino (no configuration changes)
SELECT * FROM iceberg.db.events 
WHERE event_date >= '2024-01-01';

-- Same table from Flink SQL
SELECT * FROM iceberg_catalog.db.events 
WHERE event_date >= DATE '2024-01-01';

All major cloud providers now offer native Iceberg support: AWS (Athena, Glue, EMR), Google Cloud (BigQuery, Dataproc), Azure (Synapse, HDInsight), and Snowflake. This multi-vendor adoption creates a network effect.

Delta Lake historically required Spark, but the delta-rs project (Rust-based) and UniForm (which writes Iceberg and Hudi metadata alongside Delta) have expanded compatibility. Databricks customers get the smoothest experience, but standalone Delta works well with Trino and Flink now.

Hudi supports Spark, Flink, and Presto/Trino, but its complexity means fewer engines implement full support. The trade-off is more powerful features for supported engines.

When to Choose What

Choose Apache Hudi when:

  • You’re building CDC pipelines from operational databases
  • Your workload is write-heavy with frequent updates to individual records
  • You need near-real-time data freshness with streaming ingestion
  • You can invest in operational expertise

Choose Apache Iceberg when:

  • You need multi-engine access (Spark, Trino, Flink, Snowflake)
  • Your partition strategy will evolve over time
  • You want the broadest vendor support and community momentum
  • You’re building a vendor-neutral data platform

Choose Delta Lake when:

  • You’re running on Databricks (it’s the obvious choice)
  • Your team is Spark-centric with deep Spark expertise
  • You want the simplest operational model
  • You can use UniForm for cross-format compatibility

For migrations, all three formats support reading Parquet directly, so you can migrate incrementally by pointing new writes to the table format while reading historical data from raw Parquet.

Future Outlook

The formats are converging. Delta’s UniForm writes Iceberg-compatible metadata. Iceberg’s REST catalog enables universal access. Hudi 1.0 simplified its architecture significantly.

My recommendation: design for portability. Use catalog abstractions, avoid format-specific SQL extensions where possible, and keep your data in standard Parquet. The format wars will continue, but your data shouldn’t be held hostage.

If starting fresh today with no existing infrastructure, Iceberg offers the safest bet for long-term flexibility. If you’re on Databricks, use Delta with UniForm enabled. If you’re building streaming CDC pipelines, Hudi’s record-level capabilities justify its complexity.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.