Apache Spark - Spark on YARN Tutorial
Running Apache Spark on YARN (Yet Another Resource Negotiator) remains the most common deployment pattern in enterprise environments. If your organization already runs Hadoop, you have YARN. Rather...
Key Insights
- YARN integration allows Spark to share cluster resources with other Hadoop ecosystem tools, eliminating the need for dedicated Spark clusters and improving overall resource utilization
- Choose client mode for interactive development and debugging, cluster mode for production jobs—the driver location fundamentally changes how you monitor and troubleshoot applications
- Memory configuration is the most common source of YARN container failures; always account for off-heap overhead by setting
spark.yarn.executor.memoryOverheadappropriately
Introduction to Spark on YARN
Running Apache Spark on YARN (Yet Another Resource Negotiator) remains the most common deployment pattern in enterprise environments. If your organization already runs Hadoop, you have YARN. Rather than spinning up a separate Spark standalone cluster, you can leverage YARN to manage Spark workloads alongside MapReduce, Hive, and other Hadoop ecosystem tools.
YARN acts as the cluster’s operating system. It handles resource allocation, scheduling, and fault tolerance. Spark becomes just another application requesting containers from YARN, which means you get unified resource management, multi-tenancy support, and integration with existing security frameworks like Kerberos.
This tutorial walks you through deploying Spark on YARN from scratch, covering architecture, configuration, deployment modes, and the troubleshooting knowledge you’ll need when things go wrong.
Architecture Overview
Understanding the component relationships prevents confusion when debugging issues later.
YARN consists of two main daemons:
- ResourceManager (RM): The master daemon that arbitrates resources across all applications
- NodeManager (NM): Per-node agents that manage containers and report resource usage
When you submit a Spark application to YARN, here’s what happens:
- The Spark client contacts the ResourceManager requesting an ApplicationMaster container
- YARN allocates a container and launches the Spark ApplicationMaster
- The ApplicationMaster requests additional containers for Spark executors
- NodeManagers launch executor containers on allocated nodes
- Executors register with the Spark driver and begin processing tasks
┌─────────────────────────────────────────────────────────────┐
│ YARN Cluster │
│ ┌──────────────────┐ │
│ │ ResourceManager │◄─────── Resource Requests │
│ └────────┬─────────┘ │
│ │ │
│ ┌────────▼─────────┐ ┌──────────────────┐ │
│ │ NodeManager │ │ NodeManager │ │
│ │ ┌────────────┐ │ │ ┌────────────┐ │ │
│ │ │ App Master │ │ │ │ Executor │ │ │
│ │ │ (Driver) │ │ │ └────────────┘ │ │
│ │ └────────────┘ │ │ ┌────────────┐ │ │
│ │ ┌────────────┐ │ │ │ Executor │ │ │
│ │ │ Executor │ │ │ └────────────┘ │ │
│ │ └────────────┘ │ │ │ │
│ └──────────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Prerequisites and Environment Setup
Before running Spark on YARN, ensure you have:
- Java 8 or 11 (Java 17 supported in Spark 3.3+)
- Hadoop 2.7+ with YARN configured and running
- Apache Spark 3.x compatible with your Hadoop version
- Network connectivity to YARN ResourceManager
Set up the required environment variables:
#!/bin/bash
# spark-yarn-env.sh
# Java home - adjust for your installation
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk
# Hadoop configuration directory
export HADOOP_CONF_DIR=/etc/hadoop/conf
# Alternative: YARN_CONF_DIR (Spark checks both)
export YARN_CONF_DIR=/etc/hadoop/conf
# Spark installation
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
# Optional: Hadoop native libraries for better performance
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
Verify your setup:
# Check Hadoop configuration is accessible
hadoop classpath
# Verify YARN is running
yarn node -list
# Test Spark can find YARN configuration
spark-submit --version
# Check YARN ResourceManager UI (default port)
curl -s http://resourcemanager-host:8088/ws/v1/cluster/info | jq .
The HADOOP_CONF_DIR must contain yarn-site.xml and core-site.xml. Spark reads these files to locate the ResourceManager and understand cluster configuration.
Deployment Modes: Client vs Cluster
This distinction trips up newcomers more than any other concept. The deploy mode determines where the Spark driver runs.
Client Mode (--deploy-mode client):
- Driver runs on the machine where you execute
spark-submit - Stdout/stderr appear in your terminal
- Killing the terminal kills the application
- Best for: interactive development,
spark-shell, notebooks
spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 4 \
--executor-memory 4g \
--executor-cores 2 \
my_app.py
Cluster Mode (--deploy-mode cluster):
- Driver runs inside a YARN container on the cluster
- Your terminal can disconnect without killing the job
- Logs must be retrieved via YARN
- Best for: production jobs, scheduled workflows, long-running applications
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 4 \
--executor-memory 4g \
--executor-cores 2 \
my_app.py
Use client mode during development when you need immediate feedback. Switch to cluster mode for production deployments where jobs run unattended.
Configuring Spark for YARN
Proper configuration prevents most YARN-related failures. Create or modify $SPARK_HOME/conf/spark-defaults.conf:
# spark-defaults.conf
# Master setting (can override via command line)
spark.master yarn
# Memory settings
spark.driver.memory 2g
spark.executor.memory 4g
spark.executor.cores 2
# YARN-specific settings
spark.yarn.executor.memoryOverhead 512m
spark.yarn.driver.memoryOverhead 512m
spark.yarn.queue default
# Dynamic allocation - let YARN scale executors
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 20
spark.dynamicAllocation.initialExecutors 4
spark.shuffle.service.enabled true
# Application settings
spark.yarn.submit.waitAppCompletion true
spark.yarn.appMasterEnv.SPARK_HOME /opt/spark
# Archive Spark jars to HDFS for faster startup
spark.yarn.archive hdfs:///spark/spark-libs.jar
Upload Spark JARs to HDFS to avoid shipping them with every job:
# Create archive of Spark jars
cd $SPARK_HOME
jar cv0f spark-libs.jar -C jars/ .
# Upload to HDFS
hdfs dfs -mkdir -p /spark
hdfs dfs -put spark-libs.jar /spark/
Override configurations at submit time when needed:
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executor.memory=8g \
--conf spark.executor.cores=4 \
--conf spark.yarn.queue=production \
--conf spark.dynamicAllocation.maxExecutors=50 \
my_production_job.py
Running Your First Application
Let’s submit a complete example. First, create a simple PySpark word count:
# wordcount.py
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder \
.appName("YARN WordCount Example") \
.getOrCreate()
# Create sample data
data = ["hello world", "hello spark", "spark on yarn tutorial"]
rdd = spark.sparkContext.parallelize(data)
# Word count
counts = rdd.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.collect()
for word, count in sorted(counts, key=lambda x: -x[1]):
print(f"{word}: {count}")
spark.stop()
if __name__ == "__main__":
main()
Submit to YARN:
spark-submit \
--master yarn \
--deploy-mode cluster \
--name "WordCount-Production" \
--num-executors 2 \
--executor-memory 2g \
--executor-cores 1 \
--conf spark.yarn.maxAppAttempts=2 \
wordcount.py
Monitor the application:
# List running applications
yarn application -list
# Get application status
yarn application -status application_1234567890123_0001
# View logs after completion (cluster mode)
yarn logs -applicationId application_1234567890123_0001
# View logs for specific container
yarn logs -applicationId application_1234567890123_0001 \
-containerId container_1234567890123_0001_01_000001
Access the YARN ResourceManager UI at http://resourcemanager:8088 to see running applications, resource usage, and access the Spark UI through the ApplicationMaster link.
Troubleshooting and Best Practices
Container killed due to memory exceeded: The most common failure. YARN kills containers exceeding their memory allocation. The fix involves understanding that Spark’s executor.memory only covers heap space.
# Total container memory = executor.memory + memoryOverhead
# Default overhead = max(384MB, 0.1 * executor.memory)
# For a 4g executor, explicitly set overhead:
spark.executor.memory 4g
spark.yarn.executor.memoryOverhead 1g
# Total YARN container: 5g
Application not finding dependencies: Ship dependencies with your application or ensure they exist on all nodes:
spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files dependencies.zip \
--jars external-lib.jar \
--files config.json \
my_app.py
Classpath issues with Hive or other Hadoop components:
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.HADOOP_USER_NAME=$USER \
--conf spark.executorEnv.HADOOP_USER_NAME=$USER \
--conf spark.driver.extraClassPath=/etc/hive/conf \
--conf spark.executor.extraClassPath=/etc/hive/conf \
my_app.py
Enable log aggregation for post-mortem debugging:
<!-- yarn-site.xml -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value> <!-- 7 days -->
</property>
Production best practices:
- Always use cluster mode for scheduled jobs
- Set
spark.yarn.maxAppAttemptsto allow retries on transient failures - Configure
spark.yarn.am.attemptFailuresValidityIntervalto reset failure counts for long-running applications - Use YARN queues to isolate production workloads from development
- Enable dynamic allocation to handle varying workloads efficiently
- Pre-upload Spark archives to HDFS to reduce job startup time
# Production submit template
spark-submit \
--master yarn \
--deploy-mode cluster \
--queue production \
--conf spark.yarn.maxAppAttempts=4 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.speculation=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
production_job.py
Running Spark on YARN requires understanding both systems. Master the configuration, know the difference between deployment modes, and always account for memory overhead. These fundamentals will save you hours of debugging when scaling to production workloads.