Apache Spark - Environment Variables Configuration

Apache Spark's flexibility comes with configuration complexity. Before your Spark application processes a single record, dozens of environment variables influence how the JVM starts, how much memory...

Key Insights

  • Spark environment variables provide the foundation for cluster behavior, and misconfiguration here cascades into memory errors, network failures, and performance degradation that are difficult to diagnose later.
  • The spark-env.sh file executes before every Spark process starts, making it the single most important configuration touchpoint for consistent cluster behavior.
  • Memory and core settings require careful calculation based on actual hardware resources—overcommitting leads to OOM kills, while undercommitting wastes expensive cluster capacity.

Introduction to Spark Environment Configuration

Apache Spark’s flexibility comes with configuration complexity. Before your Spark application processes a single record, dozens of environment variables influence how the JVM starts, how much memory gets allocated, which network interfaces bind, and where logs land. Getting these right means the difference between a stable production cluster and 3 AM pages about executor failures.

Environment variables in Spark operate at a lower level than SparkConf properties. They control the runtime environment itself—JVM settings, Python interpreters, file paths, and network bindings. You can’t set JAVA_HOME from within your Spark application because the JVM needs that value before your code even loads.

This article covers the environment variables that matter most in production Spark deployments. I’ll skip the obscure options and focus on what you’ll actually configure.

Core Configuration Files

Spark reads configuration from two primary files in the $SPARK_HOME/conf directory: spark-env.sh and spark-defaults.conf. Understanding when each gets loaded prevents confusion about why your settings aren’t taking effect.

spark-env.sh is a shell script that Spark sources before launching any process. It runs in the shell context, so you can include conditional logic, read from other files, or compute values dynamically. This script sets environment variables that affect the JVM and runtime environment.

spark-defaults.conf contains Spark configuration properties (the spark.* namespace) as key-value pairs. These get passed to SparkConf and affect Spark’s internal behavior rather than the runtime environment.

The loading order matters: spark-env.sh runs first, then command-line arguments override defaults, and finally programmatic SparkConf settings take precedence for properties (but can’t override environment variables).

#!/usr/bin/env bash
# spark-env.sh - Template for production deployments

# Prevent script from failing silently
set -e

# Core paths
export SPARK_HOME="/opt/spark"
export SPARK_CONF_DIR="${SPARK_HOME}/conf"
export SPARK_LOG_DIR="/var/log/spark"
export SPARK_PID_DIR="/var/run/spark"

# Java configuration
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk"

# Worker configuration (set on worker nodes)
export SPARK_WORKER_DIR="/data/spark/work"

# Source local overrides if present
if [ -f "${SPARK_CONF_DIR}/spark-env-local.sh" ]; then
    source "${SPARK_CONF_DIR}/spark-env-local.sh"
fi

The template above demonstrates a pattern I recommend: keep spark-env.sh minimal and source node-specific overrides from a separate file. This lets you maintain a single base configuration across your cluster while accommodating hardware differences.

Essential Memory & Resource Variables

Memory configuration causes more Spark failures than any other setting. The variables here control how much memory Spark requests and how it distributes resources across workers.

SPARK_EXECUTOR_MEMORY sets the heap size for executor JVMs. This isn’t the total memory an executor uses—account for off-heap storage, Python processes, and JVM overhead. A safe rule: set executor memory to no more than 75% of the memory you want each executor to consume.

SPARK_DRIVER_MEMORY controls the driver’s heap. For applications that collect large results or broadcast substantial data, you’ll need more driver memory. Start with 2-4GB for typical workloads.

SPARK_WORKER_CORES limits how many cores a worker offers to the cluster. On shared machines, set this below the actual core count to leave headroom for the OS and other processes.

SPARK_WORKER_MEMORY caps the total memory a worker can allocate to executors. This prevents a single worker from consuming all system memory when running multiple executors.

# spark-env.sh - Memory configuration for a 64GB, 16-core node
# Reserve 4GB for OS and other processes

# Worker-level limits
export SPARK_WORKER_CORES=14
export SPARK_WORKER_MEMORY="56g"

# Default executor settings (can be overridden per-application)
export SPARK_EXECUTOR_MEMORY="12g"
export SPARK_EXECUTOR_CORES=4

# Driver memory for cluster mode
export SPARK_DRIVER_MEMORY="4g"

# Off-heap memory for Tungsten and Python
export SPARK_EXECUTOR_OPTS="-XX:MaxDirectMemorySize=2g"

With this configuration on a 64GB node, you can run 4 executors (14 cores / 4 cores each, rounded down) with 12GB heap each. Total executor memory: 48GB, leaving 8GB for off-heap usage and system overhead.

Java & Runtime Variables

Spark runs on the JVM, and JVM configuration directly impacts performance. These variables control Java settings and, for PySpark users, Python interpreter selection.

JAVA_HOME points to your JDK installation. Spark requires this to locate the java binary. Always set this explicitly rather than relying on system defaults—it prevents surprises when system updates change the default Java version.

SPARK_JAVA_OPTS passes options to all Spark JVMs (driver and executors). Use this for JVM flags that should apply universally. For executor-specific or driver-specific options, use SPARK_EXECUTOR_OPTS and SPARK_DRIVER_OPTS instead.

PYSPARK_PYTHON specifies the Python interpreter for executor processes. PYSPARK_DRIVER_PYTHON sets the driver’s interpreter. These must point to identical Python environments across your cluster, or you’ll encounter serialization failures.

# spark-env.sh - JVM and Python configuration

export JAVA_HOME="/usr/lib/jvm/java-11-openjdk"

# Common JVM options for all Spark processes
export SPARK_JAVA_OPTS="-Djava.io.tmpdir=/data/spark/tmp"

# Executor JVM tuning - G1GC for large heaps
export SPARK_EXECUTOR_OPTS="
    -XX:+UseG1GC
    -XX:G1HeapRegionSize=16m
    -XX:InitiatingHeapOccupancyPercent=35
    -XX:+ParallelRefProcEnabled
    -XX:+ExitOnOutOfMemoryError
"

# Driver JVM options
export SPARK_DRIVER_OPTS="
    -XX:+UseG1GC
    -XX:MaxGCPauseMillis=200
    -Dlog4j.configuration=file:${SPARK_CONF_DIR}/log4j-driver.properties
"

# Python configuration - use virtual environment
export PYSPARK_PYTHON="/opt/spark-venv/bin/python"
export PYSPARK_DRIVER_PYTHON="/opt/spark-venv/bin/python"

The G1 garbage collector works well for Spark’s memory access patterns. The -XX:+ExitOnOutOfMemoryError flag ensures executors die cleanly on OOM rather than hanging in an undefined state.

Cluster & Network Configuration

Network settings determine how Spark processes discover and communicate with each other. Misconfigurations here cause connection timeouts and cluster formation failures.

SPARK_MASTER_HOST sets the hostname or IP the master binds to and advertises. On multi-homed hosts, set this explicitly to the interface workers should connect to.

SPARK_MASTER_PORT defaults to 7077. Change it if you run multiple masters or have port conflicts.

SPARK_LOCAL_IP tells Spark which local IP address to use for binding. Critical on hosts with multiple network interfaces.

SPARK_PUBLIC_DNS sets the hostname Spark advertises to external clients. Use this when your internal and external hostnames differ.

# spark-env.sh - Multi-node cluster configuration

# Master node configuration (set only on master)
export SPARK_MASTER_HOST="spark-master.internal.example.com"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080

# Worker configuration (set on all workers)
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081

# Network binding - use internal network interface
export SPARK_LOCAL_IP=$(hostname -I | awk '{print $1}')

# Public DNS for external access (e.g., through load balancer)
export SPARK_PUBLIC_DNS="${HOSTNAME}.example.com"

# Increase connection timeout for large clusters
export SPARK_DAEMON_JAVA_OPTS="-Dspark.network.timeout=300s"

For Kubernetes or YARN deployments, most of these settings get managed by the cluster manager. You’ll still set SPARK_LOCAL_IP when pods have multiple interfaces.

Logging & Debugging Variables

Proper logging configuration makes debugging production issues possible. These variables control where Spark writes logs and how processes identify themselves.

SPARK_LOG_DIR sets the directory for daemon logs (master, worker, history server). Ensure this directory exists and has appropriate permissions on all nodes.

SPARK_PID_DIR stores process ID files. Spark uses these to track running daemons and prevent duplicate starts.

SPARK_IDENT_STRING customizes the identifier in log filenames and process names. Useful when running multiple Spark installations on the same host.

#!/usr/bin/env bash
# spark-env.sh - Logging configuration

export SPARK_LOG_DIR="/var/log/spark"
export SPARK_PID_DIR="/var/run/spark"
export SPARK_IDENT_STRING="${USER}-production"

# Create directories with correct permissions
mkdir -p "${SPARK_LOG_DIR}" "${SPARK_PID_DIR}"
chmod 755 "${SPARK_LOG_DIR}" "${SPARK_PID_DIR}"

# Log rotation handled externally - keep Spark logs manageable
export SPARK_DAEMON_JAVA_OPTS="
    -Dlog4j.configuration=file:${SPARK_CONF_DIR}/log4j-daemon.properties
    -Dspark.log.maxSize=100m
    -Dspark.log.maxBackupIndex=10
"

# Event logging for history server
export SPARK_HISTORY_OPTS="
    -Dspark.history.fs.logDirectory=hdfs:///spark-events
    -Dspark.history.retainedApplications=50
"

Best Practices & Troubleshooting

Environment variable issues often manifest as cryptic errors far from the actual cause. Here’s how to prevent and diagnose problems.

Precedence matters: Command-line arguments override spark-env.sh, and SparkConf overrides spark-defaults.conf. But environment variables like JAVA_HOME can’t be overridden programmatically—they’re already used before your code runs.

Validate before deployment: Don’t discover configuration errors during a production job. Run validation at startup.

#!/usr/bin/env bash
# validate-spark-env.sh - Run before deploying Spark jobs

set -e

echo "Validating Spark environment..."

# Check required variables
required_vars=(
    "SPARK_HOME"
    "JAVA_HOME"
    "SPARK_EXECUTOR_MEMORY"
    "SPARK_DRIVER_MEMORY"
)

for var in "${required_vars[@]}"; do
    if [ -z "${!var}" ]; then
        echo "ERROR: ${var} is not set"
        exit 1
    fi
    echo "✓ ${var}=${!var}"
done

# Validate JAVA_HOME points to working Java
if ! "${JAVA_HOME}/bin/java" -version &>/dev/null; then
    echo "ERROR: JAVA_HOME does not contain valid Java installation"
    exit 1
fi
echo "✓ Java version: $("${JAVA_HOME}/bin/java" -version 2>&1 | head -1)"

# Check memory settings are parseable
for mem_var in SPARK_EXECUTOR_MEMORY SPARK_DRIVER_MEMORY; do
    value="${!mem_var}"
    if ! [[ "$value" =~ ^[0-9]+[gGmM]$ ]]; then
        echo "ERROR: ${mem_var}='${value}' is not valid (expected format: 12g, 512m)"
        exit 1
    fi
done

# Verify log directory is writable
if [ -n "${SPARK_LOG_DIR}" ]; then
    if ! [ -w "${SPARK_LOG_DIR}" ]; then
        echo "ERROR: SPARK_LOG_DIR '${SPARK_LOG_DIR}' is not writable"
        exit 1
    fi
    echo "✓ Log directory writable: ${SPARK_LOG_DIR}"
fi

# Test PySpark Python if configured
if [ -n "${PYSPARK_PYTHON}" ]; then
    if ! "${PYSPARK_PYTHON}" --version &>/dev/null; then
        echo "ERROR: PYSPARK_PYTHON '${PYSPARK_PYTHON}' is not executable"
        exit 1
    fi
    echo "✓ PySpark Python: ${PYSPARK_PYTHON}"
fi

echo "Environment validation passed"

Run this script as part of your deployment pipeline and on cluster startup. It catches the common issues—missing Java, invalid memory formats, unwritable directories—before they cause runtime failures.

Common pitfalls to avoid: Don’t set SPARK_EXECUTOR_MEMORY higher than physical memory. Don’t forget that executor memory excludes off-heap usage. Don’t assume environment variables propagate to remote executors in cluster mode—they don’t unless you configure them to.

Environment configuration isn’t glamorous, but it’s foundational. Get it right once, validate it continuously, and your Spark clusters will thank you with stability.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.