Apache Spark - Deploy Mode (Client vs Cluster)

Key Insights

Client mode keeps the driver on your submitting machine, making it ideal for interactive development and debugging, but creates a network dependency that can kill long-running jobs
Cluster mode runs the driver inside the cluster, providing fault tolerance and eliminating network bottlenecks—essential for production workloads and automated pipelines
Your choice between modes should be deliberate: client mode for development and ad-hoc analysis, cluster mode for anything that runs unattended or longer than a few minutes

Introduction to Spark Deploy Modes

When you submit a Spark application, you’re making a fundamental architectural decision that affects reliability, debugging capability, and resource utilization. The deploy mode determines where your driver program runs—and this single choice cascades into how your application behaves under failure, how you access logs, and whether your laptop dying kills a four-hour ETL job.

Most developers learn Spark in client mode by default, then wonder why their production jobs fail mysteriously when their VPN disconnects. Understanding both modes isn’t optional knowledge—it’s the difference between applications that work on your machine and applications that work in production.

Understanding the Driver Program

The driver is the brain of your Spark application. It runs your main() function, creates the SparkContext (or SparkSession), builds the execution plan, and coordinates work across executors. Every task schedule, every shuffle coordination, every result collection flows through the driver.

from pyspark.sql import SparkSession

# This code runs on the driver
spark = SparkSession.builder \
    .appName("DataProcessingJob") \
    .config("spark.executor.memory", "4g") \
    .config("spark.executor.cores", "2") \
    .getOrCreate()

# The driver builds the execution plan
df = spark.read.parquet("s3://bucket/input/")
result = df.filter(df.status == "active") \
    .groupBy("category") \
    .count()

# collect() brings data back to the driver
final_results = result.collect()

# Driver processes the collected results
for row in final_results:
    print(f"Category: {row.category}, Count: {row['count']}")

spark.stop()

The driver maintains continuous communication with executors throughout job execution. It assigns tasks, receives status updates, handles retries, and collects results. If the driver dies, your entire application dies. Where you run the driver—on your local machine or inside the cluster—fundamentally changes your application’s resilience profile.

Client Mode Deep Dive

In client mode, the driver runs on the machine that submits the application. Your laptop, your jump box, your CI runner—wherever you execute spark-submit, that’s where the driver lives.

# Client mode on YARN
spark-submit \
    --master yarn \
    --deploy-mode client \
    --executor-memory 4g \
    --executor-cores 2 \
    --num-executors 10 \
    my_spark_job.py

# Client mode on Kubernetes
spark-submit \
    --master k8s://https://kubernetes-api:6443 \
    --deploy-mode client \
    --conf spark.kubernetes.container.image=my-spark:3.4 \
    --executor-memory 4g \
    my_spark_job.py

# Client mode on Standalone cluster
spark-submit \
    --master spark://master-node:7077 \
    --deploy-mode client \
    --executor-memory 4g \
    my_spark_job.py

Client mode shines for interactive work. When you run pyspark or spark-shell, you’re using client mode. The driver runs locally, so you see output immediately, can set breakpoints, and iterate quickly.

The critical implication: your submitting machine must stay connected and running for the entire job duration. The driver needs continuous network access to executors. If your laptop sleeps, your VPN drops, or your SSH session times out, the job fails. There’s no recovery—executors lose contact with the driver and the application terminates.

For a 10-minute development job, this is fine. For a 6-hour production ETL, it’s a liability.

Cluster Mode Deep Dive

In cluster mode, Spark submits your application to the cluster, and the cluster manager launches the driver on a worker node. Your submitting machine can disconnect immediately after submission—the driver runs independently inside the cluster.

# Cluster mode on YARN
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --executor-memory 4g \
    --executor-cores 2 \
    --num-executors 10 \
    --conf spark.yarn.submit.waitAppCompletion=false \
    my_spark_job.py

# Cluster mode on Kubernetes
spark-submit \
    --master k8s://https://kubernetes-api:6443 \
    --deploy-mode cluster \
    --conf spark.kubernetes.container.image=my-spark:3.4 \
    --conf spark.kubernetes.driver.pod.name=my-job-driver \
    --executor-memory 4g \
    my_spark_job.py

# Cluster mode on Standalone
spark-submit \
    --master spark://master-node:7077 \
    --deploy-mode cluster \
    --supervise \
    --executor-memory 4g \
    my_spark_job.py

The --supervise flag on Standalone clusters enables automatic driver restart on failure. YARN and Kubernetes have their own mechanisms for driver recovery.

Log access differs significantly in cluster mode. Since the driver runs on a cluster node, you can’t just look at your terminal:

# YARN: retrieve logs after completion
yarn logs -applicationId application_1234567890_0001

# YARN: view logs while running
yarn logs -applicationId application_1234567890_0001 -follow

# Kubernetes: stream driver logs
kubectl logs -f my-job-driver

# Standalone: check the Spark UI or worker node logs
# Access via http://master-node:8080

Side-by-Side Comparison

Aspect	Client Mode	Cluster Mode
Driver location	Submitting machine	Cluster worker node
Network dependency	Continuous connection required	Only needed for submission
Stdout/stderr	Visible in terminal	Must retrieve from cluster
Interactive use	Fully supported	Not supported
Job survival	Dies if client disconnects	Survives client disconnect
Resource usage	Driver resources from client	Driver resources from cluster
Debugging	Easy—local breakpoints work	Harder—remote debugging needed
Production suitability	Poor	Excellent
Log access	Immediate	Requires retrieval
Typical latency	Lower for small jobs	Slightly higher startup

Choosing the Right Mode

Here’s a decision framework that works:

Use client mode when:

Running interactive shells (pyspark, spark-shell)
Developing and debugging applications
Running ad-hoc queries where you need immediate feedback
Job duration is under 30 minutes and you’re actively monitoring
You need to set breakpoints or step through code

Use cluster mode when:

Running production jobs
Job duration exceeds 30 minutes
Jobs run unattended (scheduled, triggered by events)
Network reliability between client and cluster is uncertain
Running from CI/CD pipelines

Here’s how to implement environment-based mode selection in a CI/CD pipeline:

# GitLab CI example
variables:
  SPARK_MASTER: "yarn"

.spark_job_template: &spark_job
  script:
    - |
      if [ "$CI_ENVIRONMENT_NAME" == "production" ]; then
        DEPLOY_MODE="cluster"
        WAIT_COMPLETION="false"
      else
        DEPLOY_MODE="client"
        WAIT_COMPLETION="true"
      fi
      
      spark-submit \
        --master ${SPARK_MASTER} \
        --deploy-mode ${DEPLOY_MODE} \
        --conf spark.yarn.submit.waitAppCompletion=${WAIT_COMPLETION} \
        --executor-memory ${EXECUTOR_MEMORY:-4g} \
        --num-executors ${NUM_EXECUTORS:-10} \
        ${SPARK_JOB_PATH}      

staging_etl:
  <<: *spark_job
  environment: staging
  variables:
    SPARK_JOB_PATH: "jobs/daily_etl.py"

production_etl:
  <<: *spark_job
  environment: production
  variables:
    SPARK_JOB_PATH: "jobs/daily_etl.py"
    EXECUTOR_MEMORY: "8g"
    NUM_EXECUTORS: "50"

Common Pitfalls and Troubleshooting

Client Mode Network Timeouts

The most common client mode failure: network interruption kills the job. Symptoms include executors reporting “Lost connection to driver” and jobs failing after hours of successful processing.

Prevention: Don’t use client mode for long-running jobs. If you must, use a stable bastion host inside the same network as the cluster, not your laptop over VPN.

Cluster Mode Dependency Shipping

In cluster mode, the driver runs on a cluster node that doesn’t have your local files. You must explicitly ship dependencies:

# Ship Python files
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --py-files utils.py,helpers.py,lib.zip \
    main_job.py

# Ship JARs for UDFs or connectors
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --jars /path/to/custom-udf.jar,/path/to/connector.jar \
    --packages org.apache.spark:spark-avro_2.12:3.4.0 \
    main_job.py

# Ship configuration files
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --files config.json,credentials.properties \
    --conf spark.executorEnv.CONFIG_PATH=./config.json \
    main_job.py

A common mistake: referencing absolute local paths in your code. In cluster mode, shipped files land in the working directory with their basename only:

# Wrong - absolute path won't exist on cluster
config = load_config("/home/user/project/config.json")

# Right - relative path to shipped file
config = load_config("./config.json")

# Better - use SparkFiles for explicit handling
from pyspark import SparkFiles
config_path = SparkFiles.get("config.json")
config = load_config(config_path)

Memory Configuration Differences

In client mode, driver memory comes from your local machine. In cluster mode, it comes from cluster resources. Misconfiguration causes different failures:

# Client mode - driver memory from your machine
spark-submit \
    --deploy-mode client \
    --driver-memory 8g \  # Your laptop needs 8GB available
    job.py

# Cluster mode - driver memory from cluster
spark-submit \
    --deploy-mode cluster \
    --driver-memory 8g \  # Cluster must have node with 8GB+ available
    --conf spark.yarn.driver.memoryOverhead=1g \
    job.py

In cluster mode on YARN, ensure your driver memory plus overhead doesn’t exceed the maximum container size, or your job will hang waiting for resources.

Log Hunting in Cluster Mode

When cluster mode jobs fail, finding logs requires knowing your cluster manager:

# YARN - get all logs
yarn logs -applicationId application_1234567890_0001 > job_logs.txt

# YARN - get only driver logs
yarn logs -applicationId application_1234567890_0001 \
    -containerId container_1234567890_0001_01_000001

# Kubernetes - driver pod logs
kubectl logs spark-job-driver -n spark-namespace

# Kubernetes - get logs from completed pods
kubectl logs spark-job-driver -n spark-namespace --previous

Set up log aggregation (Elasticsearch, CloudWatch, Stackdriver) for production cluster mode jobs. Hunting through distributed logs manually doesn’t scale.

The deploy mode choice seems small but compounds across every job you run. Make it deliberately, and your production Spark applications will be dramatically more reliable.