Apache Spark - Garbage Collection Tuning
Garbage collection in Apache Spark isn't just a JVM concern—it's a distributed systems problem. When an executor pauses for GC, it's not just that node slowing down. Task stragglers delay entire...
Key Insights
- Garbage collection tuning in Spark often yields 20-40% performance improvements, yet most teams skip it entirely, defaulting to configurations optimized for general Java applications rather than distributed data processing.
- The G1 garbage collector with properly sized regions handles most Spark workloads well, but memory-intensive jobs benefit from tuning young generation size and adjusting
spark.memory.fractionbased on your shuffle-to-cache ratio. - Code-level changes—using broadcast variables, avoiding unnecessary UDFs, and enabling Kryo serialization—often matter more than JVM flag tweaking; fix the object allocation patterns first, then tune the collector.
Introduction to GC in Spark
Garbage collection in Apache Spark isn’t just a JVM concern—it’s a distributed systems problem. When an executor pauses for GC, it’s not just that node slowing down. Task stragglers delay entire stages, speculative execution wastes cluster resources, and in extreme cases, executors get marked as lost when they can’t heartbeat to the driver.
Spark’s programming model encourages operations that create massive object churn. Every map(), filter(), and groupBy() potentially allocates millions of intermediate objects. The framework does clever things with memory pools and off-heap storage, but your code still runs on the JVM, and the JVM still needs to clean up after you.
I’ve seen production jobs go from 45-minute runtimes to 28 minutes purely through GC tuning. That’s not optimization theater—that’s real compute cost savings and faster data pipelines.
Understanding Spark’s Memory Model
Spark divides executor memory into distinct regions, and understanding this split is essential for effective GC tuning.
The unified memory model (introduced in Spark 1.6) dynamically shares memory between execution and storage. Execution memory handles shuffles, joins, sorts, and aggregations. Storage memory caches RDDs and broadcast variables. When one side needs more memory and the other has spare capacity, borrowing happens automatically.
Here’s how to configure the basic memory layout:
val spark = SparkSession.builder()
.appName("GC-Optimized-Job")
.config("spark.executor.memory", "8g")
.config("spark.memory.fraction", "0.6") // Total managed memory (default: 0.6)
.config("spark.memory.storageFraction", "0.5") // Initial storage share (default: 0.5)
.config("spark.executor.memoryOverhead", "1g") // Off-heap overhead
.getOrCreate()
The remaining 40% (1 - spark.memory.fraction) goes to user data structures, internal metadata, and safeguarding against OOM errors from memory estimation inaccuracies.
Object creation patterns matter enormously. RDD transformations on case classes create new objects for every record. DataFrames using the Tungsten engine store data in off-heap binary format, dramatically reducing GC pressure. This is why DataFrame operations typically outperform equivalent RDD code—it’s not just query optimization, it’s memory management.
Identifying GC Problems
Before tuning anything, you need visibility into what’s actually happening. Enable GC logging with these executor options:
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -Xloggc:/tmp/gc-executor.log"
For Java 11+, use the unified logging syntax:
--conf "spark.executor.extraJavaOptions=-Xlog:gc*:file=/tmp/gc-executor.log:time,uptime,level,tags"
The Spark UI’s Executors tab shows GC time per executor. If any executor shows GC time exceeding 10% of task time, you have a problem worth addressing. Look for these warning signs:
- Long GC pauses: Full GC events taking 10+ seconds indicate heap sizing issues or memory leaks.
- Frequent young GC: More than 1-2 minor collections per second suggests excessive object allocation.
- Executor failures: “ExecutorLostFailure” with “GC overhead limit exceeded” means the JVM spent too much time collecting garbage.
- Task stragglers: One task taking 10x longer than peers often correlates with GC pressure on that executor.
Choosing the Right Garbage Collector
For Spark workloads, you’re realistically choosing between three collectors:
Parallel GC (default in Java 8): Optimizes for throughput. Uses multiple threads for both young and old generation collection but stops the world for all GC events. Good baseline for batch jobs where occasional long pauses are acceptable.
G1GC: Balances throughput and latency. Divides the heap into regions and collects garbage incrementally. Better for larger heaps (8GB+) and when you need more predictable pause times.
ZGC (Java 11+): Sub-millisecond pause times regardless of heap size. Higher CPU overhead but eliminates GC as a source of task stragglers. Consider for latency-sensitive streaming jobs.
For most Spark batch workloads, G1GC with proper tuning hits the sweet spot:
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC \
-XX:G1HeapRegionSize=16m \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:ConcGCThreads=4 \
-XX:+ParallelRefProcEnabled"
The G1HeapRegionSize should be set so you have 2048-4096 regions. For an 8GB heap, 16MB regions give you 512 regions—reasonable but on the lower end. For 32GB heaps, use 16MB; for 64GB heaps, consider 32MB.
InitiatingHeapOccupancyPercent=35 starts concurrent marking earlier than the default 45%, giving G1 more time to complete collection before running out of space.
Key Tuning Parameters
Here’s a complete spark-submit command incorporating GC tuning for a typical 8GB executor:
spark-submit \
--class com.example.DataPipeline \
--master yarn \
--deploy-mode cluster \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 20 \
--conf "spark.memory.fraction=0.6" \
--conf "spark.memory.storageFraction=0.3" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC \
-XX:G1HeapRegionSize=16m \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:MaxGCPauseMillis=200 \
-XX:+ParallelRefProcEnabled \
-XX:+UnlockDiagnosticVMOptions \
-XX:+G1SummarizeConcMark \
-Xlog:gc*:file=/var/log/spark/gc-%p.log:time" \
--conf "spark.executor.memoryOverhead=2g" \
application.jar
Key decisions in this configuration:
spark.memory.storageFraction=0.3: Reduced from default 0.5 because this job is shuffle-heavy with minimal caching.memoryOverhead=2g: Increased from the default 10% to handle off-heap allocations from Netty and native code.MaxGCPauseMillis=200: Tells G1 to target 200ms pauses. It’s a goal, not a guarantee.
For streaming workloads where latency matters more than throughput, consider ZGC:
--conf "spark.executor.extraJavaOptions=-XX:+UseZGC -XX:+ZGenerational -Xlog:gc*:file=/var/log/spark/gc-%p.log"
Code-Level Optimizations
JVM tuning has limits. If your code creates billions of unnecessary objects, no garbage collector configuration will save you. Here’s a real-world example showing the difference code patterns make:
// Bad: Creates new objects for every record
val result = data.rdd.map { row =>
val config = loadConfig() // Object created per record!
val processor = new DataProcessor(config)
processor.transform(row)
}
// Better: Use broadcast variables
val configBroadcast = spark.sparkContext.broadcast(loadConfig())
val result = data.rdd.map { row =>
val processor = new DataProcessor(configBroadcast.value)
processor.transform(row)
}
// Best: Reuse processor across partition
val result = data.rdd.mapPartitions { iterator =>
val processor = new DataProcessor(configBroadcast.value)
iterator.map(row => processor.transform(row))
}
The mapPartitions version creates one processor per partition instead of one per record. For a dataset with 100 million records across 1000 partitions, that’s 100 million objects reduced to 1000.
Enable Kryo serialization for additional memory efficiency:
val spark = SparkSession.builder()
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryo.registrationRequired", "false")
.config("spark.kryo.registrator", "com.example.MyKryoRegistrator")
.getOrCreate()
// Custom registrator for your domain classes
class MyKryoRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo): Unit = {
kryo.register(classOf[UserEvent])
kryo.register(classOf[AggregatedMetric])
}
}
Kryo serialization typically produces objects 2-10x smaller than Java serialization, directly reducing memory pressure and GC frequency.
Monitoring and Iterating
One-time tuning isn’t enough. Workload characteristics change as data volumes grow and processing logic evolves. Set up continuous monitoring with a custom Spark listener:
class GCMetricsListener extends SparkListener {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
val metrics = taskEnd.taskMetrics
val gcTime = metrics.jvmGCTime
val runTime = metrics.executorRunTime
if (runTime > 0) {
val gcRatio = gcTime.toDouble / runTime
if (gcRatio > 0.1) {
logWarning(s"Task ${taskEnd.taskInfo.taskId} spent ${(gcRatio * 100).toInt}% time in GC")
}
}
// Push to your metrics system
MetricsRegistry.histogram("spark.gc.ratio").update(gcRatio)
MetricsRegistry.histogram("spark.gc.time.ms").update(gcTime)
}
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
val stageInfo = stageCompleted.stageInfo
val totalGcTime = stageInfo.taskMetrics.map(_.jvmGCTime).getOrElse(0L)
MetricsRegistry.counter(s"spark.stage.${stageInfo.stageId}.gc.total").inc(totalGcTime)
}
}
// Register the listener
spark.sparkContext.addSparkListener(new GCMetricsListener())
Establish baselines before making changes. Run your job three times with identical data, record GC times, and calculate variance. Then make one change at a time and measure again. GC tuning is empirical—theories about what should work often don’t match reality.
Track these metrics over time: median GC pause duration, 99th percentile pause duration, GC time as percentage of task time, and full GC frequency. Alert when any of these regress significantly from your baseline.
The goal isn’t zero GC—that’s impossible on the JVM. The goal is predictable, bounded GC that doesn’t create stragglers or threaten executor stability. With proper tuning, GC should be invisible in your job’s performance profile.