Apache Spark - Production Deployment Checklist

Key Insights

Proper executor sizing follows the “5 cores per executor” rule—more cores create excessive GC overhead, fewer waste parallelism potential
Production Spark deployments fail most often due to missing observability, not incorrect business logic—instrument everything from day one
Security and automation aren’t optional hardening steps; they’re foundational requirements that become exponentially harder to retrofit

Cluster Sizing and Resource Allocation

Getting resource allocation wrong is the fastest path to production incidents. Too little memory causes OOM kills. Too many cores per executor creates GC nightmares. The sweet spot requires understanding your workload characteristics.

Start with the “5 cores per executor” heuristic. HDFS clients perform poorly beyond 5 concurrent threads, and JVM garbage collection becomes problematic with larger heap sizes. For memory, reserve 10% for overhead and allocate based on your data’s memory footprint after deserialization.

Here’s a production-ready spark-submit configuration:

#!/bin/bash
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 4g \
  --driver-cores 2 \
  --executor-memory 18g \
  --executor-cores 5 \
  --num-executors 20 \
  --conf spark.executor.memoryOverhead=2g \
  --conf spark.driver.memoryOverhead=1g \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.minExecutors=5 \
  --conf spark.dynamicAllocation.maxExecutors=50 \
  --conf spark.dynamicAllocation.executorIdleTimeout=60s \
  --conf spark.dynamicAllocation.schedulerBacklogTimeout=1s \
  --conf spark.shuffle.service.enabled=true \
  --class com.company.DataPipeline \
  /opt/spark-apps/pipeline-1.0.0.jar

Dynamic allocation is non-negotiable for production. Fixed executor counts waste resources during light processing phases and starve jobs during heavy ones. Set minExecutors to handle baseline load and maxExecutors based on cluster capacity limits.

Configuration Hardening

Default Spark configurations optimize for development convenience, not production reliability. Your spark-defaults.conf needs deliberate tuning.

# Serialization - Kryo is 10x faster than Java serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired=false
spark.kryo.unsafe=true

# Memory management
spark.memory.fraction=0.6
spark.memory.storageFraction=0.5
spark.sql.shuffle.partitions=200

# Shuffle optimization
spark.shuffle.compress=true
spark.shuffle.spill.compress=true
spark.shuffle.file.buffer=1m
spark.reducer.maxSizeInFlight=96m
spark.shuffle.io.maxRetries=10
spark.shuffle.io.retryWait=30s

# Network timeouts for production stability
spark.network.timeout=600s
spark.executor.heartbeatInterval=30s
spark.rpc.askTimeout=300s

# Speculation for straggler mitigation
spark.speculation=true
spark.speculation.multiplier=1.5
spark.speculation.quantile=0.9

# SQL optimizations
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.broadcastTimeout=600

The adaptive query execution settings deserve attention. AQE dynamically adjusts shuffle partitions and handles skewed joins—problems that previously required manual intervention. Enable it unconditionally for Spark 3.x deployments.

Monitoring and Observability

Spark’s built-in UI provides post-mortem analysis, but production requires real-time metrics. Push metrics to Prometheus or Graphite for alerting and trend analysis.

Configure the metrics sink in metrics.properties:

*.sink.prometheus.class=org.apache.spark.metrics.sink.PrometheusSink
*.sink.prometheus.pushgateway-address=prometheus-pushgateway:9091
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds
*.sink.prometheus.pushgateway-enable-timestamp=true

master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

For custom job-level metrics, implement a SparkListener:

import org.apache.spark.scheduler._
import io.prometheus.client.{Counter, Histogram}

class ProductionSparkListener extends SparkListener {
  
  private val jobDuration = Histogram.build()
    .name("spark_job_duration_seconds")
    .help("Job execution duration")
    .labelNames("job_name", "status")
    .register()
  
  private val taskFailures = Counter.build()
    .name("spark_task_failures_total")
    .help("Total task failures")
    .labelNames("stage_id", "reason")
    .register()
  
  private val jobStartTimes = scala.collection.mutable.Map[Int, Long]()
  
  override def onJobStart(event: SparkListenerJobStart): Unit = {
    jobStartTimes(event.jobId) = System.currentTimeMillis()
  }
  
  override def onJobEnd(event: SparkListenerJobEnd): Unit = {
    val duration = (System.currentTimeMillis() - 
      jobStartTimes.getOrElse(event.jobId, 0L)) / 1000.0
    val status = event.jobResult match {
      case JobSucceeded => "success"
      case _ => "failure"
    }
    jobDuration.labels(s"job_${event.jobId}", status).observe(duration)
    jobStartTimes.remove(event.jobId)
  }
  
  override def onTaskEnd(event: SparkListenerTaskEnd): Unit = {
    Option(event.reason).foreach {
      case _: Success => // ignore
      case reason => 
        taskFailures.labels(
          event.stageId.toString, 
          reason.getClass.getSimpleName
        ).inc()
    }
  }
}

Data Pipeline Reliability

Streaming pipelines require checkpointing for fault tolerance. Batch pipelines benefit from it for recovery. Configure checkpoints to durable storage—never local disk.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger

val spark = SparkSession.builder()
  .appName("ReliableStreamingPipeline")
  .config("spark.sql.streaming.checkpointLocation", "s3a://bucket/checkpoints/")
  .getOrCreate()

val kafkaStream = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "kafka:9092")
  .option("subscribe", "events")
  .option("startingOffsets", "earliest")
  .option("failOnDataLoss", "false")  // Production: handle missing offsets gracefully
  .option("maxOffsetsPerTrigger", 100000)  // Backpressure control
  .load()

val processedStream = kafkaStream
  .selectExpr("CAST(value AS STRING) as json")
  .writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "s3a://bucket/checkpoints/events-pipeline/")
  .option("mergeSchema", "true")
  .trigger(Trigger.ProcessingTime("30 seconds"))
  .start("s3a://bucket/processed-events/")

processedStream.awaitTermination()

For exactly-once semantics, use idempotent sinks or transactional writes. Delta Lake and Apache Iceberg provide ACID guarantees. If using Kafka, enable spark.streaming.kafka.consumer.cache.enabled=false to prevent consumer group conflicts during recovery.

Security Configuration

Production clusters require authentication, encryption, and access control. Kerberos remains the standard for enterprise deployments.

#!/bin/bash
# Secure spark-submit with Kerberos authentication

export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/opt/spark

kinit -kt /etc/security/keytabs/spark.service.keytab spark/hostname@REALM.COM

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --principal spark/hostname@REALM.COM \
  --keytab /etc/security/keytabs/spark.service.keytab \
  --conf spark.yarn.access.hadoopFileSystems=s3a://secure-bucket \
  --conf spark.authenticate=true \
  --conf spark.authenticate.secret=file:///etc/spark/secret \
  --conf spark.network.crypto.enabled=true \
  --conf spark.network.crypto.keyLength=256 \
  --conf spark.io.encryption.enabled=true \
  --conf spark.ssl.enabled=true \
  --conf spark.ssl.keyStore=/etc/spark/ssl/keystore.jks \
  --conf spark.ssl.keyStorePassword=file:///etc/spark/ssl/keystore.password \
  --conf spark.ssl.trustStore=/etc/spark/ssl/truststore.jks \
  --conf spark.ssl.trustStorePassword=file:///etc/spark/ssl/truststore.password \
  --class com.company.SecurePipeline \
  /opt/spark-apps/secure-pipeline-1.0.0.jar

Never embed secrets in configuration files. Use a secrets manager and inject credentials at runtime through environment variables or mounted files.

Deployment Automation

Containerized Spark deployments on Kubernetes provide reproducibility and simplified operations. Here’s a production-ready setup:

FROM apache/spark:3.5.0-scala2.12-java17-python3-ubuntu

USER root

# Install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy application JAR
COPY --chown=spark:spark target/pipeline-1.0.0.jar /opt/spark/jars/
COPY --chown=spark:spark conf/log4j2.properties /opt/spark/conf/

USER spark
WORKDIR /opt/spark

ENTRYPOINT ["/opt/entrypoint.sh"]

# kubernetes/spark-application.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: data-pipeline
  namespace: spark-jobs
spec:
  type: Scala
  mode: cluster
  image: registry.company.com/spark-pipeline:1.0.0
  imagePullPolicy: Always
  mainClass: com.company.DataPipeline
  mainApplicationFile: local:///opt/spark/jars/pipeline-1.0.0.jar
  sparkVersion: "3.5.0"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 30
    onSubmissionFailureRetries: 3
  driver:
    cores: 2
    memory: "4g"
    serviceAccount: spark-driver
  executor:
    cores: 5
    instances: 10
    memory: "18g"
  dynamicAllocation:
    enabled: true
    initialExecutors: 5
    minExecutors: 5
    maxExecutors: 50

Pre-Launch Validation Checklist

Before deploying, validate every component systematically:

Category	Check	Command/Method
Dependencies	JAR conflicts resolved	`spark-submit --verbose`
Connectivity	Source systems reachable	Network connectivity test
Permissions	Service account authorized	IAM/Kerberos verification
Resources	Cluster capacity sufficient	Resource quota check
Configuration	All secrets mounted	Environment variable audit
Checkpoints	Storage writable	Write test to checkpoint path

Automate validation with a health check script:

#!/usr/bin/env python3
import subprocess
import sys
from pyspark.sql import SparkSession

def validate_cluster():
    checks = []
    
    # Verify Spark session creation
    try:
        spark = SparkSession.builder \
            .appName("ClusterHealthCheck") \
            .getOrCreate()
        checks.append(("Spark Session", True))
    except Exception as e:
        checks.append(("Spark Session", False, str(e)))
        return checks
    
    # Verify executor allocation
    sc = spark.sparkContext
    executor_count = len(sc._jsc.sc().getExecutorMemoryStatus().keys()) - 1
    checks.append(("Executors Available", executor_count > 0, f"Count: {executor_count}"))
    
    # Verify checkpoint path writable
    try:
        spark.range(1).write.mode("overwrite").parquet("s3a://bucket/health-check/")
        checks.append(("Checkpoint Storage", True))
    except Exception as e:
        checks.append(("Checkpoint Storage", False, str(e)))
    
    # Verify source connectivity
    try:
        spark.read.format("kafka") \
            .option("kafka.bootstrap.servers", "kafka:9092") \
            .option("subscribe", "health-check") \
            .option("startingOffsets", "latest") \
            .load().limit(1).collect()
        checks.append(("Kafka Connectivity", True))
    except Exception as e:
        checks.append(("Kafka Connectivity", False, str(e)))
    
    spark.stop()
    return checks

if __name__ == "__main__":
    results = validate_cluster()
    failed = [c for c in results if not c[1]]
    
    for check in results:
        status = "✓" if check[1] else "✗"
        print(f"{status} {check[0]}: {check[2] if len(check) > 2 else 'OK'}")
    
    sys.exit(1 if failed else 0)

Run this script as a pre-deployment gate in your CI/CD pipeline. A passing health check doesn’t guarantee success, but a failing one guarantees problems.

Production Spark deployments reward preparation. Invest time in proper sizing, hardened configuration, comprehensive monitoring, and automated validation. The alternative—debugging production failures at 3 AM—is considerably more expensive.