Apache Spark - Spark on Kubernetes Tutorial

Key Insights

Spark’s native Kubernetes integration eliminates the need for standalone cluster managers like YARN or Mesos, letting you run Spark workloads on the same infrastructure as your other containerized applications.
Building custom Spark images and properly configuring RBAC is essential—most production issues stem from permission problems or missing dependencies in container images.
Dynamic allocation on Kubernetes requires careful configuration of shuffle tracking and pod templates, but gets you elastic scaling that actually works.

Introduction & Prerequisites

Kubernetes has become the dominant deployment platform for Spark workloads, and for good reason. Running Spark on Kubernetes gives you resource efficiency through bin-packing, simplified infrastructure management, and the ability to share compute resources with other applications. No more maintaining separate YARN or Mesos clusters.

Since Spark 2.3, Kubernetes has been a first-class cluster manager. Spark 3.x made it production-ready with features like dynamic allocation, pod templates, and improved shuffle handling. If you’re still running Spark on YARN in 2024, you’re likely paying for infrastructure complexity you don’t need.

Before starting, ensure you have:

A Kubernetes cluster (1.22+) with at least 8 CPU cores and 16GB RAM available
kubectl configured with cluster access
Apache Spark 3.4+ distribution downloaded locally
Docker installed for building images
A container registry you can push to (Docker Hub, ECR, GCR, etc.)

Architecture Overview

Spark on Kubernetes follows a straightforward model. When you submit an application, Spark creates a driver pod that runs your main program. The driver then requests executor pods from the Kubernetes API server, which schedules them across available nodes.

In cluster mode, the driver runs inside the cluster as a pod. This is what you want for production jobs—the driver survives network disconnections and can be managed by Kubernetes like any other workload.

In client mode, the driver runs on the machine where you execute spark-submit. Use this for interactive development and debugging, but never for production.

The lifecycle works like this:

spark-submit creates the driver pod via Kubernetes API
Driver pod starts and initializes SparkContext
Driver requests executor pods based on your configuration
Kubernetes schedules executor pods across nodes
Executors register with the driver and begin processing
When the job completes, executor pods terminate
Driver pod enters “Completed” state (or gets cleaned up)

Executor pods are ephemeral. They’re created when needed and destroyed when the job finishes or when dynamic allocation scales down.

Building Spark Docker Images

Spark ships with a script to build container images, but you’ll almost certainly need customization. The base images lack common dependencies, and you’ll want to include your application JARs.

Start with the built-in tool:

cd $SPARK_HOME
./bin/docker-image-tool.sh \
  -r your-registry.com/spark \
  -t 3.4.1 \
  -p ./kubernetes/dockerfiles/spark/Dockerfile \
  build

./bin/docker-image-tool.sh \
  -r your-registry.com/spark \
  -t 3.4.1 \
  push

For production, create a custom Dockerfile that extends the base image:

FROM your-registry.com/spark/spark:3.4.1

USER root

# Install additional Python packages for PySpark
RUN pip3 install --no-cache-dir \
    pandas==2.0.3 \
    pyarrow==12.0.1 \
    numpy==1.24.3

# Add custom JARs (Delta Lake, connectors, etc.)
COPY --chown=spark:spark jars/delta-core_2.12-2.4.0.jar /opt/spark/jars/
COPY --chown=spark:spark jars/aws-java-sdk-bundle-1.12.262.jar /opt/spark/jars/
COPY --chown=spark:spark jars/hadoop-aws-3.3.4.jar /opt/spark/jars/

# Add your application code
COPY --chown=spark:spark app/ /opt/spark/app/

USER spark

Build and push your custom image:

docker build -t your-registry.com/spark/spark-custom:3.4.1 .
docker push your-registry.com/spark/spark-custom:3.4.1

Configuring Kubernetes for Spark

Spark needs permissions to create and manage pods. The most common production issue I see is permission errors because someone skipped RBAC setup.

Create a dedicated namespace and service account:

# spark-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: spark-jobs
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-jobs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-role
  namespace: spark-jobs
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps", "persistentvolumeclaims"]
    verbs: ["create", "get", "list", "watch", "delete", "patch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: spark-jobs
subjects:
  - kind: ServiceAccount
    name: spark
    namespace: spark-jobs
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

Apply the configuration:

kubectl apply -f spark-namespace.yaml

For production clusters, add resource quotas to prevent runaway jobs from consuming all resources:

# spark-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: spark-quota
  namespace: spark-jobs
spec:
  hard:
    requests.cpu: "32"
    requests.memory: 64Gi
    limits.cpu: "64"
    limits.memory: 128Gi
    pods: "50"

Submitting Spark Applications

With infrastructure ready, submit your first application. Here’s a complete example:

$SPARK_HOME/bin/spark-submit \
  --master k8s://https://your-k8s-api-server:6443 \
  --deploy-mode cluster \
  --name my-spark-job \
  --class com.example.MySparkApp \
  --conf spark.kubernetes.namespace=spark-jobs \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=your-registry.com/spark/spark-custom:3.4.1 \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.driver.cores=2 \
  --conf spark.driver.memory=4g \
  --conf spark.executor.instances=3 \
  --conf spark.executor.cores=2 \
  --conf spark.executor.memory=4g \
  --conf spark.kubernetes.driver.request.cores=2 \
  --conf spark.kubernetes.executor.request.cores=2 \
  local:///opt/spark/app/my-app.jar

Key points:

The master URL uses k8s:// prefix followed by your API server address
Use local:// for JARs baked into the container image
Set both Spark memory (spark.executor.memory) and Kubernetes resource requests
The request.cores settings control Kubernetes scheduling

For PySpark applications:

$SPARK_HOME/bin/spark-submit \
  --master k8s://https://your-k8s-api-server:6443 \
  --deploy-mode cluster \
  --name pyspark-job \
  --conf spark.kubernetes.namespace=spark-jobs \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=your-registry.com/spark/spark-custom:3.4.1 \
  --conf spark.kubernetes.pyspark.pythonVersion=3 \
  --conf spark.driver.memory=4g \
  --conf spark.executor.memory=4g \
  --conf spark.executor.instances=3 \
  local:///opt/spark/app/my_job.py

Advanced Configuration & Tuning

Dynamic allocation lets Spark scale executors based on workload. On Kubernetes, this requires shuffle tracking since there’s no external shuffle service by default:

--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.minExecutors=1 \
--conf spark.dynamicAllocation.maxExecutors=10 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.dynamicAllocation.schedulerBacklogTimeout=10s

For jobs with large shuffles or checkpointing, mount persistent volumes:

# spark-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: spark-shuffle-pvc
  namespace: spark-jobs
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

Reference the PVC in your submit command:

--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle-vol.mount.path=/data/shuffle \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle-vol.options.claimName=spark-shuffle-pvc \
--conf spark.local.dir=/data/shuffle

Use node selectors to target specific node pools:

--conf spark.kubernetes.node.selector.workload-type=spark \
--conf spark.kubernetes.driver.node.selector.workload-type=spark

Monitoring & Troubleshooting

Access the Spark UI via port-forwarding:

# Find the driver pod
kubectl get pods -n spark-jobs -l spark-role=driver

# Forward the UI port
kubectl port-forward -n spark-jobs spark-job-driver 4040:4040

Then open http://localhost:4040 in your browser.

Check driver logs when jobs fail:

# Stream logs from running driver
kubectl logs -f -n spark-jobs spark-job-driver

# Get logs from completed/failed driver
kubectl logs -n spark-jobs spark-job-driver --previous

Inspect executor issues:

# List all pods for a specific job
kubectl get pods -n spark-jobs -l spark-app-selector=spark-job

# Check why an executor failed
kubectl describe pod -n spark-jobs spark-job-exec-1

# Get executor logs
kubectl logs -n spark-jobs spark-job-exec-1

Common failures and fixes:

ImagePullBackOff: Your container registry requires authentication. Create an image pull secret and reference it with spark.kubernetes.container.image.pullSecrets.

OOMKilled: Executor memory settings don’t account for off-heap usage. Set spark.kubernetes.memoryOverheadFactor=0.4 for memory-intensive jobs.

Pending pods: Insufficient cluster resources or node selector constraints too restrictive. Check kubectl describe pod for scheduling failures.

Permission denied: RBAC misconfiguration. Verify the service account has the role binding and the namespace is correct.

Running Spark on Kubernetes simplifies operations significantly once you get past initial setup. The investment in proper RBAC, custom images, and monitoring pays off quickly when you’re managing dozens of jobs across a shared cluster.