Apache Spark - Partition Pruning

Partition pruning is Spark's mechanism for skipping irrelevant data partitions during query execution. Think of it like a library's card catalog system: instead of walking through every aisle to find...

Key Insights

  • Partition pruning eliminates unnecessary data reads at the physical level, often reducing query execution time by 10-100x on large datasets by skipping entire file directories.
  • Dynamic Partition Pruning (DPP), introduced in Spark 3.0, extends pruning benefits to join operations by injecting filter predicates at runtime based on dimension table results.
  • Effective partition pruning requires careful column selection—choose low-cardinality columns with high query selectivity, and always verify pruning occurs using EXPLAIN and the Spark UI.

Introduction to Partition Pruning

Partition pruning is Spark’s mechanism for skipping irrelevant data partitions during query execution. Think of it like a library’s card catalog system: instead of walking through every aisle to find books published in 2023, you go directly to the 2023 section. Without partition pruning, Spark reads every file in your dataset, even when your query only needs a small subset.

For large-scale data processing, this optimization is non-negotiable. A table with 5 years of daily partitioned data contains roughly 1,825 partitions. If your query filters for last week’s data, partition pruning reads 7 partitions instead of 1,825—a 99.6% reduction in I/O.

The performance impact compounds with data size. I’ve seen production queries drop from 45 minutes to under 2 minutes simply by ensuring partition pruning engaged correctly.

How Spark Partitioning Works

Spark supports multiple partitioning strategies, but for partition pruning, we care about disk partitioning—how data files are physically organized in storage. This differs from in-memory RDD partitioning used for shuffle operations.

When you write data with partitionBy(), Spark creates a directory hierarchy based on partition column values:

/data/sales/
  year=2023/
    month=01/
      part-00000.parquet
      part-00001.parquet
    month=02/
      part-00000.parquet
  year=2024/
    month=01/
      part-00000.parquet

Here’s how to create and write a partitioned dataset:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month

spark = SparkSession.builder \
    .appName("PartitionPruningDemo") \
    .getOrCreate()

# Create sample sales data
sales_data = [
    (1, "2023-01-15", 100.00, "electronics"),
    (2, "2023-02-20", 250.00, "clothing"),
    (3, "2024-01-10", 175.00, "electronics"),
    (4, "2024-02-28", 300.00, "furniture"),
]

df = spark.createDataFrame(sales_data, ["id", "sale_date", "amount", "category"])

# Add partition columns
df_partitioned = df \
    .withColumn("sale_year", year(col("sale_date"))) \
    .withColumn("sale_month", month(col("sale_date")))

# Write with partitioning
df_partitioned.write \
    .partitionBy("sale_year", "sale_month") \
    .mode("overwrite") \
    .parquet("/data/sales_partitioned")

The partition column order matters. Place the most frequently filtered, lowest-cardinality column first. Year before month, region before city.

Static vs. Dynamic Partition Pruning

Spark performs partition pruning in two ways: static (compile-time) and dynamic (runtime).

Static partition pruning occurs during query planning when filter predicates contain literal values or expressions Spark can evaluate before execution:

# Static pruning - filter value known at compile time
df = spark.read.parquet("/data/sales_partitioned")
result = df.filter(col("sale_year") == 2024)

# Check the execution plan
result.explain(True)

The EXPLAIN output shows static pruning in action:

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#0,sale_date#1,amount#2,category#3,sale_year#4,sale_month#5]
   Batched: true
   Location: InMemoryFileIndex[/data/sales_partitioned]
   PartitionFilters: [isnotnull(sale_year#4), (sale_year#4 = 2024)]
   PushedFilters: []
   ReadSchema: struct<id:int,sale_date:string,amount:double,category:string>

The PartitionFilters line confirms pruning. Spark only reads directories matching year=2024.

Dynamic partition pruning handles cases where filter values aren’t known until runtime—specifically during joins. Before Spark 3.0, joining a large fact table with a filtered dimension table read the entire fact table. DPP changed that.

Dynamic Partition Pruning Deep Dive

DPP shines in star schema queries where you filter a dimension table and join to a fact table. Consider this scenario: you have a 10TB sales fact table partitioned by store_id and a small stores dimension table. You want sales from California stores only.

Without DPP, Spark reads all 10TB, joins with the filtered stores, then discards non-California data. With DPP, Spark first identifies California store IDs, then uses those values to prune the fact table read.

# Enable DPP (enabled by default in Spark 3.0+)
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly", "true")

# Create dimension table
stores = spark.createDataFrame([
    (1, "Store A", "CA"),
    (2, "Store B", "NY"),
    (3, "Store C", "CA"),
    (4, "Store D", "TX"),
], ["store_id", "store_name", "state"])

# Create fact table (imagine this is 10TB in production)
sales_fact = spark.createDataFrame([
    (100, 1, 500.00),
    (101, 2, 300.00),
    (102, 3, 450.00),
    (103, 4, 600.00),
], ["transaction_id", "store_id", "amount"])

# Write fact table partitioned by store_id
sales_fact.write \
    .partitionBy("store_id") \
    .mode("overwrite") \
    .parquet("/data/sales_fact")

# Query with DPP
sales_partitioned = spark.read.parquet("/data/sales_fact")

california_sales = sales_partitioned.join(
    stores.filter(col("state") == "CA"),
    "store_id"
).select("transaction_id", "store_name", "amount")

california_sales.explain(True)

The physical plan reveals DPP with a DynamicPruningExpression:

== Physical Plan ==
*(2) Project [transaction_id#10, store_name#21, amount#12]
+- *(2) BroadcastHashJoin [store_id#11], [store_id#20], Inner, BuildRight
   :- *(2) FileScan parquet [transaction_id#10,store_id#11,amount#12]
   :     PartitionFilters: [isnotnull(store_id#11), 
   :       dynamicpruningexpression(store_id#11 IN dynamicpruning#45)]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, false]))
      +- *(1) Filter (isnotnull(state#22) AND (state#22 = CA))
         +- *(1) Scan ExistingRDD[store_id#20,store_name#21,state#22]

DPP prerequisites:

  • The dimension table must be small enough to broadcast (default: under 10MB, configurable via spark.sql.autoBroadcastJoinThreshold)
  • The fact table must be partitioned on the join column
  • The join must be an equi-join

Monitoring and Verifying Pruning

Never assume pruning works—verify it. Three methods provide increasing detail:

Method 1: EXPLAIN output

df.filter(col("sale_year") == 2024).explain(True)

Look for PartitionFilters in the FileScan node. If your filter column appears there, pruning occurs. If it appears in PushedFilters instead, you’re filtering after reading—no partition benefit.

Method 2: Spark UI

Navigate to the SQL tab, find your query, and examine the scan node details. The “number of files read” metric tells the truth. If you expected to read 10 partitions but see 1,000 files, pruning failed.

Method 3: Input metrics

# Programmatic verification
df_filtered = spark.read.parquet("/data/sales_partitioned") \
    .filter(col("sale_year") == 2024)

# Force execution and check metrics
df_filtered.write.format("noop").mode("overwrite").save()

# Check Spark UI or use:
spark.sparkContext.statusTracker.getExecutorInfos()

Key metrics to monitor:

  • bytesRead: Should be proportional to filtered partitions
  • recordsRead: Same principle
  • filesRead: Direct count of partition files accessed

Common Pitfalls and Optimization Tips

Partition pruning fails silently. Your query runs, returns correct results, but reads 100x more data than necessary. Here are the common culprits:

Pitfall 1: Type mismatches

# BAD: String literal compared to integer partition column
df.filter(col("sale_year") == "2024")  # May prevent pruning

# GOOD: Matching types
df.filter(col("sale_year") == 2024)

Pitfall 2: UDFs in filter predicates

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

# BAD: UDF prevents pruning optimization
is_recent = udf(lambda x: x >= 2024, BooleanType())
df.filter(is_recent(col("sale_year")))  # Full scan

# GOOD: Native Spark expression
df.filter(col("sale_year") >= 2024)  # Pruning works

Pitfall 3: Filtering on non-partition columns

# This filters AFTER reading all data
df.filter(col("category") == "electronics")  # category isn't a partition column

# If category queries are common, repartition:
df.write.partitionBy("category", "sale_year").parquet("/data/sales_by_category")

Pitfall 4: Expressions that hide partition columns

# BAD: Expression wrapping partition column
df.filter(col("sale_year") + 0 == 2024)  # Pruning may fail

# BAD: CAST operations
df.filter(col("sale_year").cast("string") == "2024")

# GOOD: Direct comparison
df.filter(col("sale_year") == 2024)

Best practices for partition column selection:

  1. Choose columns that appear in WHERE clauses of 80%+ of queries
  2. Prefer low cardinality (date, region, status) over high cardinality (user_id, transaction_id)
  3. Avoid over-partitioning—thousands of tiny files hurt performance
  4. Target partition sizes of 128MB-1GB for optimal parallelism

Conclusion

Partition pruning transforms Spark from a brute-force data processor into a surgical query engine. The implementation requires minimal code changes but demands attention to detail: correct data types, native expressions over UDFs, and partition columns aligned with query patterns.

Start with your slowest queries. Check their execution plans. If PartitionFilters is empty while filtering on what should be partition columns, you’ve found your optimization target. For join-heavy workloads on Spark 3.0+, verify DPP engagement—it’s the difference between reading terabytes and gigabytes.

The payoff is substantial: reduced cluster costs, faster query response times, and happier stakeholders. Partition pruning isn’t glamorous, but it’s one of the highest-ROI optimizations in the Spark toolkit.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.