Apache Spark - Production Deployment Checklist
Getting resource allocation wrong is the fastest path to production incidents. Too little memory causes OOM kills. Too many cores per executor creates GC nightmares. The sweet spot requires...
Key Insights
- Proper executor sizing follows the “5 cores per executor” rule—more cores create excessive GC overhead, fewer waste parallelism potential
- Production Spark deployments fail most often due to missing observability, not incorrect business logic—instrument everything from day one
- Security and automation aren’t optional hardening steps; they’re foundational requirements that become exponentially harder to retrofit
Cluster Sizing and Resource Allocation
Getting resource allocation wrong is the fastest path to production incidents. Too little memory causes OOM kills. Too many cores per executor creates GC nightmares. The sweet spot requires understanding your workload characteristics.
Start with the “5 cores per executor” heuristic. HDFS clients perform poorly beyond 5 concurrent threads, and JVM garbage collection becomes problematic with larger heap sizes. For memory, reserve 10% for overhead and allocate based on your data’s memory footprint after deserialization.
Here’s a production-ready spark-submit configuration:
#!/bin/bash
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--driver-cores 2 \
--executor-memory 18g \
--executor-cores 5 \
--num-executors 20 \
--conf spark.executor.memoryOverhead=2g \
--conf spark.driver.memoryOverhead=1g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=50 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.dynamicAllocation.schedulerBacklogTimeout=1s \
--conf spark.shuffle.service.enabled=true \
--class com.company.DataPipeline \
/opt/spark-apps/pipeline-1.0.0.jar
Dynamic allocation is non-negotiable for production. Fixed executor counts waste resources during light processing phases and starve jobs during heavy ones. Set minExecutors to handle baseline load and maxExecutors based on cluster capacity limits.
Configuration Hardening
Default Spark configurations optimize for development convenience, not production reliability. Your spark-defaults.conf needs deliberate tuning.
# Serialization - Kryo is 10x faster than Java serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired=false
spark.kryo.unsafe=true
# Memory management
spark.memory.fraction=0.6
spark.memory.storageFraction=0.5
spark.sql.shuffle.partitions=200
# Shuffle optimization
spark.shuffle.compress=true
spark.shuffle.spill.compress=true
spark.shuffle.file.buffer=1m
spark.reducer.maxSizeInFlight=96m
spark.shuffle.io.maxRetries=10
spark.shuffle.io.retryWait=30s
# Network timeouts for production stability
spark.network.timeout=600s
spark.executor.heartbeatInterval=30s
spark.rpc.askTimeout=300s
# Speculation for straggler mitigation
spark.speculation=true
spark.speculation.multiplier=1.5
spark.speculation.quantile=0.9
# SQL optimizations
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.broadcastTimeout=600
The adaptive query execution settings deserve attention. AQE dynamically adjusts shuffle partitions and handles skewed joins—problems that previously required manual intervention. Enable it unconditionally for Spark 3.x deployments.
Monitoring and Observability
Spark’s built-in UI provides post-mortem analysis, but production requires real-time metrics. Push metrics to Prometheus or Graphite for alerting and trend analysis.
Configure the metrics sink in metrics.properties:
*.sink.prometheus.class=org.apache.spark.metrics.sink.PrometheusSink
*.sink.prometheus.pushgateway-address=prometheus-pushgateway:9091
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds
*.sink.prometheus.pushgateway-enable-timestamp=true
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
For custom job-level metrics, implement a SparkListener:
import org.apache.spark.scheduler._
import io.prometheus.client.{Counter, Histogram}
class ProductionSparkListener extends SparkListener {
private val jobDuration = Histogram.build()
.name("spark_job_duration_seconds")
.help("Job execution duration")
.labelNames("job_name", "status")
.register()
private val taskFailures = Counter.build()
.name("spark_task_failures_total")
.help("Total task failures")
.labelNames("stage_id", "reason")
.register()
private val jobStartTimes = scala.collection.mutable.Map[Int, Long]()
override def onJobStart(event: SparkListenerJobStart): Unit = {
jobStartTimes(event.jobId) = System.currentTimeMillis()
}
override def onJobEnd(event: SparkListenerJobEnd): Unit = {
val duration = (System.currentTimeMillis() -
jobStartTimes.getOrElse(event.jobId, 0L)) / 1000.0
val status = event.jobResult match {
case JobSucceeded => "success"
case _ => "failure"
}
jobDuration.labels(s"job_${event.jobId}", status).observe(duration)
jobStartTimes.remove(event.jobId)
}
override def onTaskEnd(event: SparkListenerTaskEnd): Unit = {
Option(event.reason).foreach {
case _: Success => // ignore
case reason =>
taskFailures.labels(
event.stageId.toString,
reason.getClass.getSimpleName
).inc()
}
}
}
Register the listener via spark.extraListeners=com.company.ProductionSparkListener.
Data Pipeline Reliability
Streaming pipelines require checkpointing for fault tolerance. Batch pipelines benefit from it for recovery. Configure checkpoints to durable storage—never local disk.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
val spark = SparkSession.builder()
.appName("ReliableStreamingPipeline")
.config("spark.sql.streaming.checkpointLocation", "s3a://bucket/checkpoints/")
.getOrCreate()
val kafkaStream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "events")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false") // Production: handle missing offsets gracefully
.option("maxOffsetsPerTrigger", 100000) // Backpressure control
.load()
val processedStream = kafkaStream
.selectExpr("CAST(value AS STRING) as json")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "s3a://bucket/checkpoints/events-pipeline/")
.option("mergeSchema", "true")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start("s3a://bucket/processed-events/")
processedStream.awaitTermination()
For exactly-once semantics, use idempotent sinks or transactional writes. Delta Lake and Apache Iceberg provide ACID guarantees. If using Kafka, enable spark.streaming.kafka.consumer.cache.enabled=false to prevent consumer group conflicts during recovery.
Security Configuration
Production clusters require authentication, encryption, and access control. Kerberos remains the standard for enterprise deployments.
#!/bin/bash
# Secure spark-submit with Kerberos authentication
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/opt/spark
kinit -kt /etc/security/keytabs/spark.service.keytab spark/hostname@REALM.COM
spark-submit \
--master yarn \
--deploy-mode cluster \
--principal spark/hostname@REALM.COM \
--keytab /etc/security/keytabs/spark.service.keytab \
--conf spark.yarn.access.hadoopFileSystems=s3a://secure-bucket \
--conf spark.authenticate=true \
--conf spark.authenticate.secret=file:///etc/spark/secret \
--conf spark.network.crypto.enabled=true \
--conf spark.network.crypto.keyLength=256 \
--conf spark.io.encryption.enabled=true \
--conf spark.ssl.enabled=true \
--conf spark.ssl.keyStore=/etc/spark/ssl/keystore.jks \
--conf spark.ssl.keyStorePassword=file:///etc/spark/ssl/keystore.password \
--conf spark.ssl.trustStore=/etc/spark/ssl/truststore.jks \
--conf spark.ssl.trustStorePassword=file:///etc/spark/ssl/truststore.password \
--class com.company.SecurePipeline \
/opt/spark-apps/secure-pipeline-1.0.0.jar
Never embed secrets in configuration files. Use a secrets manager and inject credentials at runtime through environment variables or mounted files.
Deployment Automation
Containerized Spark deployments on Kubernetes provide reproducibility and simplified operations. Here’s a production-ready setup:
FROM apache/spark:3.5.0-scala2.12-java17-python3-ubuntu
USER root
# Install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy application JAR
COPY --chown=spark:spark target/pipeline-1.0.0.jar /opt/spark/jars/
COPY --chown=spark:spark conf/log4j2.properties /opt/spark/conf/
USER spark
WORKDIR /opt/spark
ENTRYPOINT ["/opt/entrypoint.sh"]
# kubernetes/spark-application.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: data-pipeline
namespace: spark-jobs
spec:
type: Scala
mode: cluster
image: registry.company.com/spark-pipeline:1.0.0
imagePullPolicy: Always
mainClass: com.company.DataPipeline
mainApplicationFile: local:///opt/spark/jars/pipeline-1.0.0.jar
sparkVersion: "3.5.0"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 30
onSubmissionFailureRetries: 3
driver:
cores: 2
memory: "4g"
serviceAccount: spark-driver
executor:
cores: 5
instances: 10
memory: "18g"
dynamicAllocation:
enabled: true
initialExecutors: 5
minExecutors: 5
maxExecutors: 50
Pre-Launch Validation Checklist
Before deploying, validate every component systematically:
| Category | Check | Command/Method |
|---|---|---|
| Dependencies | JAR conflicts resolved | spark-submit --verbose |
| Connectivity | Source systems reachable | Network connectivity test |
| Permissions | Service account authorized | IAM/Kerberos verification |
| Resources | Cluster capacity sufficient | Resource quota check |
| Configuration | All secrets mounted | Environment variable audit |
| Checkpoints | Storage writable | Write test to checkpoint path |
Automate validation with a health check script:
#!/usr/bin/env python3
import subprocess
import sys
from pyspark.sql import SparkSession
def validate_cluster():
checks = []
# Verify Spark session creation
try:
spark = SparkSession.builder \
.appName("ClusterHealthCheck") \
.getOrCreate()
checks.append(("Spark Session", True))
except Exception as e:
checks.append(("Spark Session", False, str(e)))
return checks
# Verify executor allocation
sc = spark.sparkContext
executor_count = len(sc._jsc.sc().getExecutorMemoryStatus().keys()) - 1
checks.append(("Executors Available", executor_count > 0, f"Count: {executor_count}"))
# Verify checkpoint path writable
try:
spark.range(1).write.mode("overwrite").parquet("s3a://bucket/health-check/")
checks.append(("Checkpoint Storage", True))
except Exception as e:
checks.append(("Checkpoint Storage", False, str(e)))
# Verify source connectivity
try:
spark.read.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "health-check") \
.option("startingOffsets", "latest") \
.load().limit(1).collect()
checks.append(("Kafka Connectivity", True))
except Exception as e:
checks.append(("Kafka Connectivity", False, str(e)))
spark.stop()
return checks
if __name__ == "__main__":
results = validate_cluster()
failed = [c for c in results if not c[1]]
for check in results:
status = "✓" if check[1] else "✗"
print(f"{status} {check[0]}: {check[2] if len(check) > 2 else 'OK'}")
sys.exit(1 if failed else 0)
Run this script as a pre-deployment gate in your CI/CD pipeline. A passing health check doesn’t guarantee success, but a failing one guarantees problems.
Production Spark deployments reward preparation. Invest time in proper sizing, hardened configuration, comprehensive monitoring, and automated validation. The alternative—debugging production failures at 3 AM—is considerably more expensive.