Apache Spark - Deploy Mode (Client vs Cluster)
When you submit a Spark application, you're making a fundamental architectural decision that affects reliability, debugging capability, and resource utilization. The deploy mode determines where your...
Key Insights
- Client mode keeps the driver on your submitting machine, making it ideal for interactive development and debugging, but creates a network dependency that can kill long-running jobs
- Cluster mode runs the driver inside the cluster, providing fault tolerance and eliminating network bottlenecks—essential for production workloads and automated pipelines
- Your choice between modes should be deliberate: client mode for development and ad-hoc analysis, cluster mode for anything that runs unattended or longer than a few minutes
Introduction to Spark Deploy Modes
When you submit a Spark application, you’re making a fundamental architectural decision that affects reliability, debugging capability, and resource utilization. The deploy mode determines where your driver program runs—and this single choice cascades into how your application behaves under failure, how you access logs, and whether your laptop dying kills a four-hour ETL job.
Most developers learn Spark in client mode by default, then wonder why their production jobs fail mysteriously when their VPN disconnects. Understanding both modes isn’t optional knowledge—it’s the difference between applications that work on your machine and applications that work in production.
Understanding the Driver Program
The driver is the brain of your Spark application. It runs your main() function, creates the SparkContext (or SparkSession), builds the execution plan, and coordinates work across executors. Every task schedule, every shuffle coordination, every result collection flows through the driver.
from pyspark.sql import SparkSession
# This code runs on the driver
spark = SparkSession.builder \
.appName("DataProcessingJob") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "2") \
.getOrCreate()
# The driver builds the execution plan
df = spark.read.parquet("s3://bucket/input/")
result = df.filter(df.status == "active") \
.groupBy("category") \
.count()
# collect() brings data back to the driver
final_results = result.collect()
# Driver processes the collected results
for row in final_results:
print(f"Category: {row.category}, Count: {row['count']}")
spark.stop()
The driver maintains continuous communication with executors throughout job execution. It assigns tasks, receives status updates, handles retries, and collects results. If the driver dies, your entire application dies. Where you run the driver—on your local machine or inside the cluster—fundamentally changes your application’s resilience profile.
Client Mode Deep Dive
In client mode, the driver runs on the machine that submits the application. Your laptop, your jump box, your CI runner—wherever you execute spark-submit, that’s where the driver lives.
# Client mode on YARN
spark-submit \
--master yarn \
--deploy-mode client \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 10 \
my_spark_job.py
# Client mode on Kubernetes
spark-submit \
--master k8s://https://kubernetes-api:6443 \
--deploy-mode client \
--conf spark.kubernetes.container.image=my-spark:3.4 \
--executor-memory 4g \
my_spark_job.py
# Client mode on Standalone cluster
spark-submit \
--master spark://master-node:7077 \
--deploy-mode client \
--executor-memory 4g \
my_spark_job.py
Client mode shines for interactive work. When you run pyspark or spark-shell, you’re using client mode. The driver runs locally, so you see output immediately, can set breakpoints, and iterate quickly.
The critical implication: your submitting machine must stay connected and running for the entire job duration. The driver needs continuous network access to executors. If your laptop sleeps, your VPN drops, or your SSH session times out, the job fails. There’s no recovery—executors lose contact with the driver and the application terminates.
For a 10-minute development job, this is fine. For a 6-hour production ETL, it’s a liability.
Cluster Mode Deep Dive
In cluster mode, Spark submits your application to the cluster, and the cluster manager launches the driver on a worker node. Your submitting machine can disconnect immediately after submission—the driver runs independently inside the cluster.
# Cluster mode on YARN
spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 4g \
--executor-cores 2 \
--num-executors 10 \
--conf spark.yarn.submit.waitAppCompletion=false \
my_spark_job.py
# Cluster mode on Kubernetes
spark-submit \
--master k8s://https://kubernetes-api:6443 \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=my-spark:3.4 \
--conf spark.kubernetes.driver.pod.name=my-job-driver \
--executor-memory 4g \
my_spark_job.py
# Cluster mode on Standalone
spark-submit \
--master spark://master-node:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 4g \
my_spark_job.py
The --supervise flag on Standalone clusters enables automatic driver restart on failure. YARN and Kubernetes have their own mechanisms for driver recovery.
Log access differs significantly in cluster mode. Since the driver runs on a cluster node, you can’t just look at your terminal:
# YARN: retrieve logs after completion
yarn logs -applicationId application_1234567890_0001
# YARN: view logs while running
yarn logs -applicationId application_1234567890_0001 -follow
# Kubernetes: stream driver logs
kubectl logs -f my-job-driver
# Standalone: check the Spark UI or worker node logs
# Access via http://master-node:8080
Side-by-Side Comparison
| Aspect | Client Mode | Cluster Mode |
|---|---|---|
| Driver location | Submitting machine | Cluster worker node |
| Network dependency | Continuous connection required | Only needed for submission |
| Stdout/stderr | Visible in terminal | Must retrieve from cluster |
| Interactive use | Fully supported | Not supported |
| Job survival | Dies if client disconnects | Survives client disconnect |
| Resource usage | Driver resources from client | Driver resources from cluster |
| Debugging | Easy—local breakpoints work | Harder—remote debugging needed |
| Production suitability | Poor | Excellent |
| Log access | Immediate | Requires retrieval |
| Typical latency | Lower for small jobs | Slightly higher startup |
Choosing the Right Mode
Here’s a decision framework that works:
Use client mode when:
- Running interactive shells (
pyspark,spark-shell) - Developing and debugging applications
- Running ad-hoc queries where you need immediate feedback
- Job duration is under 30 minutes and you’re actively monitoring
- You need to set breakpoints or step through code
Use cluster mode when:
- Running production jobs
- Job duration exceeds 30 minutes
- Jobs run unattended (scheduled, triggered by events)
- Network reliability between client and cluster is uncertain
- Running from CI/CD pipelines
Here’s how to implement environment-based mode selection in a CI/CD pipeline:
# GitLab CI example
variables:
SPARK_MASTER: "yarn"
.spark_job_template: &spark_job
script:
- |
if [ "$CI_ENVIRONMENT_NAME" == "production" ]; then
DEPLOY_MODE="cluster"
WAIT_COMPLETION="false"
else
DEPLOY_MODE="client"
WAIT_COMPLETION="true"
fi
spark-submit \
--master ${SPARK_MASTER} \
--deploy-mode ${DEPLOY_MODE} \
--conf spark.yarn.submit.waitAppCompletion=${WAIT_COMPLETION} \
--executor-memory ${EXECUTOR_MEMORY:-4g} \
--num-executors ${NUM_EXECUTORS:-10} \
${SPARK_JOB_PATH}
staging_etl:
<<: *spark_job
environment: staging
variables:
SPARK_JOB_PATH: "jobs/daily_etl.py"
production_etl:
<<: *spark_job
environment: production
variables:
SPARK_JOB_PATH: "jobs/daily_etl.py"
EXECUTOR_MEMORY: "8g"
NUM_EXECUTORS: "50"
Common Pitfalls and Troubleshooting
Client Mode Network Timeouts
The most common client mode failure: network interruption kills the job. Symptoms include executors reporting “Lost connection to driver” and jobs failing after hours of successful processing.
Prevention: Don’t use client mode for long-running jobs. If you must, use a stable bastion host inside the same network as the cluster, not your laptop over VPN.
Cluster Mode Dependency Shipping
In cluster mode, the driver runs on a cluster node that doesn’t have your local files. You must explicitly ship dependencies:
# Ship Python files
spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files utils.py,helpers.py,lib.zip \
main_job.py
# Ship JARs for UDFs or connectors
spark-submit \
--master yarn \
--deploy-mode cluster \
--jars /path/to/custom-udf.jar,/path/to/connector.jar \
--packages org.apache.spark:spark-avro_2.12:3.4.0 \
main_job.py
# Ship configuration files
spark-submit \
--master yarn \
--deploy-mode cluster \
--files config.json,credentials.properties \
--conf spark.executorEnv.CONFIG_PATH=./config.json \
main_job.py
A common mistake: referencing absolute local paths in your code. In cluster mode, shipped files land in the working directory with their basename only:
# Wrong - absolute path won't exist on cluster
config = load_config("/home/user/project/config.json")
# Right - relative path to shipped file
config = load_config("./config.json")
# Better - use SparkFiles for explicit handling
from pyspark import SparkFiles
config_path = SparkFiles.get("config.json")
config = load_config(config_path)
Memory Configuration Differences
In client mode, driver memory comes from your local machine. In cluster mode, it comes from cluster resources. Misconfiguration causes different failures:
# Client mode - driver memory from your machine
spark-submit \
--deploy-mode client \
--driver-memory 8g \ # Your laptop needs 8GB available
job.py
# Cluster mode - driver memory from cluster
spark-submit \
--deploy-mode cluster \
--driver-memory 8g \ # Cluster must have node with 8GB+ available
--conf spark.yarn.driver.memoryOverhead=1g \
job.py
In cluster mode on YARN, ensure your driver memory plus overhead doesn’t exceed the maximum container size, or your job will hang waiting for resources.
Log Hunting in Cluster Mode
When cluster mode jobs fail, finding logs requires knowing your cluster manager:
# YARN - get all logs
yarn logs -applicationId application_1234567890_0001 > job_logs.txt
# YARN - get only driver logs
yarn logs -applicationId application_1234567890_0001 \
-containerId container_1234567890_0001_01_000001
# Kubernetes - driver pod logs
kubectl logs spark-job-driver -n spark-namespace
# Kubernetes - get logs from completed pods
kubectl logs spark-job-driver -n spark-namespace --previous
Set up log aggregation (Elasticsearch, CloudWatch, Stackdriver) for production cluster mode jobs. Hunting through distributed logs manually doesn’t scale.
The deploy mode choice seems small but compounds across every job you run. Make it deliberately, and your production Spark applications will be dramatically more reliable.