Apache Spark - Spark History Server Setup

When a Spark application finishes execution, its web UI disappears along with valuable debugging information. The Spark History Server solves this problem by persisting application event logs and...

Key Insights

  • The Spark History Server provides critical visibility into completed application execution, making it essential for debugging failures, performance tuning, and maintaining audit trails in production environments.
  • Proper event log configuration at both the application and server level is crucial—misconfigured log directories are the most common source of History Server issues.
  • Cloud storage integration requires careful attention to authentication mechanisms and JAR dependencies, but offers superior durability and cost-effectiveness compared to HDFS for long-term log retention.

Introduction to Spark History Server

When a Spark application finishes execution, its web UI disappears along with valuable debugging information. The Spark History Server solves this problem by persisting application event logs and providing a web interface to view them after completion.

In production environments, the History Server isn’t optional—it’s essential infrastructure. Without it, you’re flying blind when investigating why last night’s batch job failed or why a completed application consumed three times the expected resources. The History Server provides access to the same detailed information available in the live Spark UI: job and stage timelines, executor metrics, SQL query plans, and storage details.

Beyond debugging, the History Server serves as an audit trail. Compliance requirements often mandate retention of execution records, and the History Server provides a queryable interface to that historical data.

Prerequisites and Architecture Overview

Before setting up the History Server, ensure you have these components in place:

  • Spark installation: Version 2.x or 3.x with consistent versions across your cluster
  • Shared storage: A location accessible by both Spark applications and the History Server (HDFS, S3, GCS, Azure Blob, or a shared filesystem)
  • Network access: The History Server port must be reachable by users who need to view logs

The architecture is straightforward. Spark applications write event logs to a configured directory during execution. The History Server daemon runs separately, continuously scanning that directory for new or updated log files. When you access the History Server UI, it reads these event logs and reconstructs the application’s execution timeline.

┌─────────────────┐     writes      ┌──────────────────┐
│ Spark App 1     │ ──────────────► │                  │
└─────────────────┘                 │   Event Log      │
┌─────────────────┐     writes      │   Storage        │
│ Spark App 2     │ ──────────────► │  (HDFS/S3/etc)   │
└─────────────────┘                 │                  │
┌─────────────────┐     writes      │                  │
│ Spark App N     │ ──────────────► │                  │
└─────────────────┘                 └────────┬─────────┘
                                             │ reads
                                    ┌──────────────────┐
                                    │  History Server  │
                                    │    (Web UI)      │
                                    └──────────────────┘

Configuring Event Logging in Spark Applications

Applications must explicitly enable event logging. Add these properties to spark-defaults.conf on every node that submits Spark applications:

# Enable event logging
spark.eventLog.enabled=true

# Directory for event logs (must be accessible by History Server)
spark.eventLog.dir=hdfs:///spark-history

# Compress logs to reduce storage (recommended)
spark.eventLog.compress=true

# Compression codec (lz4 offers good balance of speed and ratio)
spark.eventLog.compression.codec=lz4

For applications submitted via spark-submit, you can also pass these as command-line arguments:

spark-submit \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=hdfs:///spark-history \
  --conf spark.eventLog.compress=true \
  --class com.example.MyApp \
  my-app.jar

Create the event log directory before running applications:

# For HDFS
hdfs dfs -mkdir -p /spark-history
hdfs dfs -chmod 1777 /spark-history

# For local filesystem (testing only)
mkdir -p /var/log/spark-history
chmod 1777 /var/log/spark-history

The sticky bit (1777) allows multiple users to write logs while preventing deletion of others’ files.

History Server Installation and Configuration

The History Server is included with Spark—no separate installation required. Configuration happens in spark-defaults.conf on the machine running the History Server:

# Directory to scan for event logs (must match spark.eventLog.dir)
spark.history.fs.logDirectory=hdfs:///spark-history

# Port for the web UI
spark.history.ui.port=18080

# How often to scan for new logs (in seconds)
spark.history.fs.update.interval=10s

# Number of applications to retain in memory
spark.history.retainedApplications=50

# Maximum disk usage for local cache
spark.history.store.maxDiskUsage=10g

# Path for local cache of event logs
spark.history.store.path=/var/spark-history-cache

# Enable access control lists (optional)
spark.history.ui.acls.enable=false

For memory allocation, set environment variables in spark-env.sh:

# Memory for History Server JVM
export SPARK_HISTORY_OPTS="-Xmx4g -XX:+UseG1GC"

# Alternatively, for more detailed tuning
export SPARK_HISTORY_OPTS="-Xmx4g -Xms2g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

Starting and Managing the History Server

Spark provides convenience scripts for managing the History Server:

# Start the History Server
$SPARK_HOME/sbin/start-history-server.sh

# Stop the History Server
$SPARK_HOME/sbin/stop-history-server.sh

# Check if it's running
jps | grep HistoryServer

# View logs
tail -f $SPARK_HOME/logs/spark-*-org.apache.spark.deploy.history.HistoryServer-*.out

For production deployments, use a systemd service unit:

# /etc/systemd/system/spark-history-server.service
[Unit]
Description=Apache Spark History Server
After=network.target

[Service]
Type=forking
User=spark
Group=spark
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk"
ExecStart=/opt/spark/sbin/start-history-server.sh
ExecStop=/opt/spark/sbin/stop-history-server.sh
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable spark-history-server
sudo systemctl start spark-history-server
sudo systemctl status spark-history-server

Access the UI at http://<history-server-host>:18080.

Cloud Storage Integration (S3/GCS/Azure Blob)

Cloud object storage is the preferred choice for event logs in cloud deployments. It’s durable, cost-effective, and doesn’t require managing HDFS infrastructure.

Amazon S3 Configuration

# spark-defaults.conf
spark.eventLog.dir=s3a://my-bucket/spark-history
spark.history.fs.logDirectory=s3a://my-bucket/spark-history

# S3A filesystem configuration
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider

For IAM role-based authentication (recommended), ensure your EC2 instance or EKS pod has an appropriate IAM role attached. For explicit credentials (not recommended for production):

spark.hadoop.fs.s3a.access.key=YOUR_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key=YOUR_SECRET_KEY

Required JARs for S3 support:

# Download and place in $SPARK_HOME/jars/
hadoop-aws-3.3.4.jar
aws-java-sdk-bundle-1.12.262.jar

Google Cloud Storage Configuration

spark.eventLog.dir=gs://my-bucket/spark-history
spark.history.fs.logDirectory=gs://my-bucket/spark-history

spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.gs.auth.type=SERVICE_ACCOUNT_JSON_KEYFILE
spark.hadoop.fs.gs.auth.service.account.json.keyfile=/path/to/service-account.json

Required JAR: gcs-connector-hadoop3-2.2.11-shaded.jar

Azure Blob Storage Configuration

spark.eventLog.dir=wasbs://container@account.blob.core.windows.net/spark-history
spark.history.fs.logDirectory=wasbs://container@account.blob.core.windows.net/spark-history

spark.hadoop.fs.azure.account.key.account.blob.core.windows.net=YOUR_STORAGE_KEY

Troubleshooting and Best Practices

Common Issues

History Server shows no applications: Verify the log directory paths match exactly between spark.eventLog.dir and spark.history.fs.logDirectory. Check permissions—the History Server user must have read access to the event logs.

# Check HDFS permissions
hdfs dfs -ls /spark-history

# Check S3 access
aws s3 ls s3://my-bucket/spark-history/

UI loads but applications fail to display: This often indicates corrupted or incomplete log files. Check the History Server logs for parsing errors:

grep -i "error\|exception" $SPARK_HOME/logs/spark-*-HistoryServer-*.out

Out of memory errors: Increase History Server heap size and reduce spark.history.retainedApplications:

export SPARK_HISTORY_OPTS="-Xmx8g"

Best Practices

Implement log retention policies: Event logs accumulate quickly. Configure automatic cleanup:

# Enable cleaner
spark.history.fs.cleaner.enabled=true

# Maximum age of log files
spark.history.fs.cleaner.maxAge=30d

# How often to run cleaner
spark.history.fs.cleaner.interval=1d

Secure the UI: In production, place the History Server behind a reverse proxy with authentication:

# nginx configuration snippet
location /spark-history/ {
    auth_basic "Spark History Server";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://localhost:18080/;
}

Monitor the History Server: Add health checks to your monitoring system:

# Simple health check
curl -s -o /dev/null -w "%{http_code}" http://localhost:18080/api/v1/applications

Separate storage from compute: Use dedicated cloud storage buckets for event logs rather than ephemeral cluster storage. This ensures logs survive cluster termination and allows centralized viewing across multiple clusters.

The Spark History Server is foundational infrastructure for any production Spark deployment. Invest the time to configure it properly, and you’ll have invaluable visibility into your application behavior when debugging issues at 2 AM.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.