Apache Spark - Spark on YARN Tutorial

Key Insights

YARN integration allows Spark to share cluster resources with other Hadoop ecosystem tools, eliminating the need for dedicated Spark clusters and improving overall resource utilization
Choose client mode for interactive development and debugging, cluster mode for production jobs—the driver location fundamentally changes how you monitor and troubleshoot applications
Memory configuration is the most common source of YARN container failures; always account for off-heap overhead by setting spark.yarn.executor.memoryOverhead appropriately

Introduction to Spark on YARN

Running Apache Spark on YARN (Yet Another Resource Negotiator) remains the most common deployment pattern in enterprise environments. If your organization already runs Hadoop, you have YARN. Rather than spinning up a separate Spark standalone cluster, you can leverage YARN to manage Spark workloads alongside MapReduce, Hive, and other Hadoop ecosystem tools.

YARN acts as the cluster’s operating system. It handles resource allocation, scheduling, and fault tolerance. Spark becomes just another application requesting containers from YARN, which means you get unified resource management, multi-tenancy support, and integration with existing security frameworks like Kerberos.

This tutorial walks you through deploying Spark on YARN from scratch, covering architecture, configuration, deployment modes, and the troubleshooting knowledge you’ll need when things go wrong.

Architecture Overview

Understanding the component relationships prevents confusion when debugging issues later.

YARN consists of two main daemons:

ResourceManager (RM): The master daemon that arbitrates resources across all applications
NodeManager (NM): Per-node agents that manage containers and report resource usage

When you submit a Spark application to YARN, here’s what happens:

The Spark client contacts the ResourceManager requesting an ApplicationMaster container
YARN allocates a container and launches the Spark ApplicationMaster
The ApplicationMaster requests additional containers for Spark executors
NodeManagers launch executor containers on allocated nodes
Executors register with the Spark driver and begin processing tasks

┌─────────────────────────────────────────────────────────────┐
│                    YARN Cluster                              │
│  ┌──────────────────┐                                       │
│  │ ResourceManager  │◄─────── Resource Requests             │
│  └────────┬─────────┘                                       │
│           │                                                  │
│  ┌────────▼─────────┐    ┌──────────────────┐              │
│  │   NodeManager    │    │   NodeManager    │              │
│  │  ┌────────────┐  │    │  ┌────────────┐  │              │
│  │  │ App Master │  │    │  │  Executor  │  │              │
│  │  │  (Driver)  │  │    │  └────────────┘  │              │
│  │  └────────────┘  │    │  ┌────────────┐  │              │
│  │  ┌────────────┐  │    │  │  Executor  │  │              │
│  │  │  Executor  │  │    │  └────────────┘  │              │
│  │  └────────────┘  │    │                  │              │
│  └──────────────────┘    └──────────────────┘              │
└─────────────────────────────────────────────────────────────┘

Prerequisites and Environment Setup

Before running Spark on YARN, ensure you have:

Java 8 or 11 (Java 17 supported in Spark 3.3+)
Hadoop 2.7+ with YARN configured and running
Apache Spark 3.x compatible with your Hadoop version
Network connectivity to YARN ResourceManager

Set up the required environment variables:

#!/bin/bash
# spark-yarn-env.sh

# Java home - adjust for your installation
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk

# Hadoop configuration directory
export HADOOP_CONF_DIR=/etc/hadoop/conf

# Alternative: YARN_CONF_DIR (Spark checks both)
export YARN_CONF_DIR=/etc/hadoop/conf

# Spark installation
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

# Optional: Hadoop native libraries for better performance
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

Verify your setup:

# Check Hadoop configuration is accessible
hadoop classpath

# Verify YARN is running
yarn node -list

# Test Spark can find YARN configuration
spark-submit --version

# Check YARN ResourceManager UI (default port)
curl -s http://resourcemanager-host:8088/ws/v1/cluster/info | jq .

The HADOOP_CONF_DIR must contain yarn-site.xml and core-site.xml. Spark reads these files to locate the ResourceManager and understand cluster configuration.

Deployment Modes: Client vs Cluster

This distinction trips up newcomers more than any other concept. The deploy mode determines where the Spark driver runs.

Client Mode (--deploy-mode client):

Driver runs on the machine where you execute spark-submit
Stdout/stderr appear in your terminal
Killing the terminal kills the application
Best for: interactive development, spark-shell, notebooks

spark-submit \
  --master yarn \
  --deploy-mode client \
  --num-executors 4 \
  --executor-memory 4g \
  --executor-cores 2 \
  my_app.py

Cluster Mode (--deploy-mode cluster):

Driver runs inside a YARN container on the cluster
Your terminal can disconnect without killing the job
Logs must be retrieved via YARN
Best for: production jobs, scheduled workflows, long-running applications

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 4 \
  --executor-memory 4g \
  --executor-cores 2 \
  my_app.py

Use client mode during development when you need immediate feedback. Switch to cluster mode for production deployments where jobs run unattended.

Configuring Spark for YARN

Proper configuration prevents most YARN-related failures. Create or modify $SPARK_HOME/conf/spark-defaults.conf:

# spark-defaults.conf

# Master setting (can override via command line)
spark.master                     yarn

# Memory settings
spark.driver.memory              2g
spark.executor.memory            4g
spark.executor.cores             2

# YARN-specific settings
spark.yarn.executor.memoryOverhead   512m
spark.yarn.driver.memoryOverhead     512m
spark.yarn.queue                     default

# Dynamic allocation - let YARN scale executors
spark.dynamicAllocation.enabled           true
spark.dynamicAllocation.minExecutors      2
spark.dynamicAllocation.maxExecutors      20
spark.dynamicAllocation.initialExecutors  4
spark.shuffle.service.enabled             true

# Application settings
spark.yarn.submit.waitAppCompletion  true
spark.yarn.appMasterEnv.SPARK_HOME   /opt/spark

# Archive Spark jars to HDFS for faster startup
spark.yarn.archive                   hdfs:///spark/spark-libs.jar

Upload Spark JARs to HDFS to avoid shipping them with every job:

# Create archive of Spark jars
cd $SPARK_HOME
jar cv0f spark-libs.jar -C jars/ .

# Upload to HDFS
hdfs dfs -mkdir -p /spark
hdfs dfs -put spark-libs.jar /spark/

Override configurations at submit time when needed:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --conf spark.executor.memory=8g \
  --conf spark.executor.cores=4 \
  --conf spark.yarn.queue=production \
  --conf spark.dynamicAllocation.maxExecutors=50 \
  my_production_job.py

Running Your First Application

Let’s submit a complete example. First, create a simple PySpark word count:

# wordcount.py
from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder \
        .appName("YARN WordCount Example") \
        .getOrCreate()
    
    # Create sample data
    data = ["hello world", "hello spark", "spark on yarn tutorial"]
    rdd = spark.sparkContext.parallelize(data)
    
    # Word count
    counts = rdd.flatMap(lambda line: line.split(" ")) \
                .map(lambda word: (word, 1)) \
                .reduceByKey(lambda a, b: a + b) \
                .collect()
    
    for word, count in sorted(counts, key=lambda x: -x[1]):
        print(f"{word}: {count}")
    
    spark.stop()

if __name__ == "__main__":
    main()

Submit to YARN:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --name "WordCount-Production" \
  --num-executors 2 \
  --executor-memory 2g \
  --executor-cores 1 \
  --conf spark.yarn.maxAppAttempts=2 \
  wordcount.py

Monitor the application:

# List running applications
yarn application -list

# Get application status
yarn application -status application_1234567890123_0001

# View logs after completion (cluster mode)
yarn logs -applicationId application_1234567890123_0001

# View logs for specific container
yarn logs -applicationId application_1234567890123_0001 \
  -containerId container_1234567890123_0001_01_000001

Access the YARN ResourceManager UI at http://resourcemanager:8088 to see running applications, resource usage, and access the Spark UI through the ApplicationMaster link.

Troubleshooting and Best Practices

Container killed due to memory exceeded: The most common failure. YARN kills containers exceeding their memory allocation. The fix involves understanding that Spark’s executor.memory only covers heap space.

# Total container memory = executor.memory + memoryOverhead
# Default overhead = max(384MB, 0.1 * executor.memory)

# For a 4g executor, explicitly set overhead:
spark.executor.memory              4g
spark.yarn.executor.memoryOverhead 1g
# Total YARN container: 5g

Application not finding dependencies: Ship dependencies with your application or ensure they exist on all nodes:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --py-files dependencies.zip \
  --jars external-lib.jar \
  --files config.json \
  my_app.py

Classpath issues with Hive or other Hadoop components:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --conf spark.yarn.appMasterEnv.HADOOP_USER_NAME=$USER \
  --conf spark.executorEnv.HADOOP_USER_NAME=$USER \
  --conf spark.driver.extraClassPath=/etc/hive/conf \
  --conf spark.executor.extraClassPath=/etc/hive/conf \
  my_app.py

Enable log aggregation for post-mortem debugging:

<!-- yarn-site.xml -->
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value> <!-- 7 days -->
</property>

Production best practices:

Always use cluster mode for scheduled jobs
Set spark.yarn.maxAppAttempts to allow retries on transient failures
Configure spark.yarn.am.attemptFailuresValidityInterval to reset failure counts for long-running applications
Use YARN queues to isolate production workloads from development
Enable dynamic allocation to handle varying workloads efficiently
Pre-upload Spark archives to HDFS to reduce job startup time

# Production submit template
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --queue production \
  --conf spark.yarn.maxAppAttempts=4 \
  --conf spark.yarn.am.attemptFailuresValidityInterval=1h \
  --conf spark.speculation=true \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.shuffle.service.enabled=true \
  production_job.py

Running Spark on YARN requires understanding both systems. Master the configuration, know the difference between deployment modes, and always account for memory overhead. These fundamentals will save you hours of debugging when scaling to production workloads.