Apache Spark - spark-submit Command Guide
spark-submit is the command-line tool that ships with Apache Spark for deploying applications to a cluster. Whether you're running a batch ETL job, a streaming pipeline, or a machine learning...
Key Insights
- spark-submit is the universal entry point for deploying Spark applications across all cluster managers—mastering its options is essential for production deployments
- Resource configuration (memory, executors, cores) directly impacts job performance and cost; there’s no one-size-fits-all setting, but understanding the trade-offs helps you tune effectively
- Most spark-submit failures stem from dependency conflicts, memory misconfiguration, or serialization issues—systematic debugging with proper logging saves hours of frustration
Introduction to spark-submit
spark-submit is the command-line tool that ships with Apache Spark for deploying applications to a cluster. Whether you’re running a batch ETL job, a streaming pipeline, or a machine learning workflow, spark-submit is how you get your code from development to execution.
You use spark-submit when you have a packaged application (JAR for Scala/Java, Python file, or R script) ready for production or testing on a cluster. It differs from interactive modes like spark-shell or PySpark REPL, which are great for exploration but impractical for scheduled jobs or CI/CD pipelines.
The tool handles the complexity of distributing your application code, managing dependencies, configuring resources, and communicating with your cluster manager. Understanding its options thoroughly separates engineers who “make Spark work” from those who make it work efficiently.
Basic Command Syntax and Structure
The fundamental structure of spark-submit follows a predictable pattern:
spark-submit [options] <app jar | python file> [app arguments]
For Scala or Java applications, you typically specify a JAR file and the main class:
spark-submit \
--class com.mycompany.analytics.DailyAggregation \
--master yarn \
/path/to/my-spark-app-1.0.jar \
--input /data/raw/2024-01-15 \
--output /data/processed/2024-01-15
For Python applications, you point directly to your main script:
spark-submit \
--master yarn \
/path/to/etl_pipeline.py \
--date 2024-01-15
Everything after the JAR or Python file becomes application arguments, passed to your main method or accessible via sys.argv in Python. Keep this separation clear—Spark options come before the application path, application arguments come after.
Essential Configuration Options
These flags control how Spark allocates resources and where your application runs. Getting them right is critical for performance.
–master: Specifies the cluster manager URL. Options include local[*] for local testing, spark://host:port for standalone clusters, yarn for Hadoop YARN, and k8s://https://host:port for Kubernetes.
–deploy-mode: Either client (driver runs on the submitting machine) or cluster (driver runs on a worker node). Use cluster mode for production jobs submitted from edge nodes or CI systems.
–executor-memory: Memory per executor JVM. This is the heap size, not total container memory.
–driver-memory: Memory for the driver JVM. Increase this for jobs with large broadcast variables or collect operations.
–num-executors: Number of executor instances. Only relevant when dynamic allocation is disabled.
–executor-cores: CPU cores per executor. Affects parallelism within each executor.
Here’s a typical configuration for a medium-sized batch job:
spark-submit \
--class com.mycompany.analytics.UserSessionAnalysis \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 20 \
/path/to/analytics-app-2.0.jar
For fine-grained control, use --conf to set any Spark configuration property:
spark-submit \
--class com.mycompany.analytics.UserSessionAnalysis \
--master yarn \
--deploy-mode cluster \
--conf spark.sql.shuffle.partitions=400 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.network.timeout=600s \
/path/to/analytics-app-2.0.jar
Cluster Manager Modes
spark-submit adapts to different cluster managers, but the submission syntax varies.
Local mode runs everything in a single JVM—useful for development and testing:
spark-submit \
--master local[4] \
--driver-memory 2g \
my_analysis.py
YARN is the most common production deployment. Cluster mode is preferred for scheduled jobs:
spark-submit \
--master yarn \
--deploy-mode cluster \
--queue production \
--driver-memory 4g \
--driver-cores 2 \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 50 \
--conf spark.yarn.maxAppAttempts=2 \
--conf spark.yarn.submit.waitAppCompletion=true \
/path/to/my-app.jar
Kubernetes requires container image configuration and namespace settings:
spark-submit \
--master k8s://https://kubernetes-api-server:6443 \
--deploy-mode cluster \
--name spark-etl-job \
--conf spark.kubernetes.container.image=myregistry/spark:3.5.0 \
--conf spark.kubernetes.namespace=spark-jobs \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.kubernetes.executor.limit.cores=4 \
--conf spark.kubernetes.driver.request.cores=1 \
--conf spark.kubernetes.driver.limit.cores=2 \
--driver-memory 2g \
--executor-memory 4g \
--num-executors 10 \
local:///opt/spark/jars/my-app.jar
Note the local:// prefix for the JAR path—this tells Spark the file is already present in the container image, avoiding upload overhead.
Dependency Management
Real applications have dependencies. spark-submit provides several mechanisms to distribute them.
–jars: Distributes additional JARs to all executors. Use comma-separated paths:
spark-submit \
--master yarn \
--jars /path/to/postgresql-42.6.0.jar,/path/to/commons-csv-1.10.0.jar \
--class com.mycompany.DataExport \
my-app.jar
–packages: Downloads dependencies from Maven repositories automatically. This is cleaner than managing JAR files manually:
spark-submit \
--master yarn \
--packages org.postgresql:postgresql:42.6.0,org.apache.spark:spark-avro_2.12:3.5.0 \
--repositories https://maven.mycompany.com/releases \
--class com.mycompany.DataExport \
my-app.jar
–py-files: Distributes Python modules, ZIP files, or egg files for PySpark applications:
spark-submit \
--master yarn \
--py-files utils.zip,transformations.py,/path/to/my_package-1.0-py3-none-any.whl \
main_etl.py
–files: Distributes arbitrary files accessible via SparkFiles.get() in your application:
spark-submit \
--master yarn \
--files config.json,lookup_table.csv \
--class com.mycompany.ConfigurableJob \
my-app.jar
Common Patterns and Best Practices
Production spark-submit commands tend to be verbose. Wrap them in shell scripts for maintainability:
#!/bin/bash
set -euo pipefail
APP_JAR="/opt/spark-apps/etl-pipeline-3.2.1.jar"
MAIN_CLASS="com.mycompany.etl.DailyPipeline"
DATE=${1:-$(date -d "yesterday" +%Y-%m-%d)}
spark-submit \
--master yarn \
--deploy-mode cluster \
--name "daily-etl-${DATE}" \
--queue etl-production \
--driver-memory 4g \
--driver-cores 2 \
--executor-memory 8g \
--executor-cores 4 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.coalescePartitions.enabled=true \
--conf spark.sql.shuffle.partitions=auto \
--conf spark.speculation=true \
--conf spark.speculation.multiplier=1.5 \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs:///spark-logs \
--conf spark.yarn.maxAppAttempts=2 \
--class "${MAIN_CLASS}" \
"${APP_JAR}" \
--date "${DATE}" \
--env production
Key practices demonstrated here:
- Dynamic allocation scales executors based on workload, reducing costs for variable jobs
- Adaptive Query Execution (AQE) optimizes shuffle partitions at runtime
- Speculation reruns slow tasks on other executors, preventing stragglers from blocking completion
- Event logging enables post-mortem analysis via Spark History Server
Troubleshooting Common Errors
ClassNotFoundException or NoClassDefFoundError: Your dependency isn’t on the classpath. Verify your --jars or --packages flags. For shaded JARs, check that dependencies aren’t being excluded during packaging.
OutOfMemoryError on driver: Increase --driver-memory. This commonly happens with large collect() operations or broadcasting big datasets.
OutOfMemoryError on executors: Increase --executor-memory or reduce --executor-cores (fewer cores means less concurrent memory pressure). Also check for data skew causing some tasks to process disproportionate data.
Container killed by YARN for exceeding memory limits: Executor memory overhead (off-heap memory) isn’t included in --executor-memory. Increase spark.executor.memoryOverhead (default is 10% of executor memory or 384MB, whichever is larger).
Enable verbose logging to diagnose issues:
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:log4j-debug.properties" \
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:log4j-debug.properties" \
--files log4j-debug.properties \
--verbose \
--class com.mycompany.ProblematicJob \
my-app.jar
The --verbose flag prints the resolved configuration, helping you verify that your settings are being applied correctly.
For serialization errors (NotSerializableException), ensure any objects used in closures are serializable. Consider using broadcast variables for large read-only data structures, and avoid capturing outer class references in anonymous functions.
spark-submit is deceptively simple on the surface but rewards deep understanding. Master its options, build repeatable submission scripts, and invest time in proper resource tuning—your production jobs will thank you.