Apache Spark - Spark History Server Setup
When a Spark application finishes execution, its web UI disappears along with valuable debugging information. The Spark History Server solves this problem by persisting application event logs and...
Key Insights
- The Spark History Server provides critical visibility into completed application execution, making it essential for debugging failures, performance tuning, and maintaining audit trails in production environments.
- Proper event log configuration at both the application and server level is crucial—misconfigured log directories are the most common source of History Server issues.
- Cloud storage integration requires careful attention to authentication mechanisms and JAR dependencies, but offers superior durability and cost-effectiveness compared to HDFS for long-term log retention.
Introduction to Spark History Server
When a Spark application finishes execution, its web UI disappears along with valuable debugging information. The Spark History Server solves this problem by persisting application event logs and providing a web interface to view them after completion.
In production environments, the History Server isn’t optional—it’s essential infrastructure. Without it, you’re flying blind when investigating why last night’s batch job failed or why a completed application consumed three times the expected resources. The History Server provides access to the same detailed information available in the live Spark UI: job and stage timelines, executor metrics, SQL query plans, and storage details.
Beyond debugging, the History Server serves as an audit trail. Compliance requirements often mandate retention of execution records, and the History Server provides a queryable interface to that historical data.
Prerequisites and Architecture Overview
Before setting up the History Server, ensure you have these components in place:
- Spark installation: Version 2.x or 3.x with consistent versions across your cluster
- Shared storage: A location accessible by both Spark applications and the History Server (HDFS, S3, GCS, Azure Blob, or a shared filesystem)
- Network access: The History Server port must be reachable by users who need to view logs
The architecture is straightforward. Spark applications write event logs to a configured directory during execution. The History Server daemon runs separately, continuously scanning that directory for new or updated log files. When you access the History Server UI, it reads these event logs and reconstructs the application’s execution timeline.
┌─────────────────┐ writes ┌──────────────────┐
│ Spark App 1 │ ──────────────► │ │
└─────────────────┘ │ Event Log │
┌─────────────────┐ writes │ Storage │
│ Spark App 2 │ ──────────────► │ (HDFS/S3/etc) │
└─────────────────┘ │ │
┌─────────────────┐ writes │ │
│ Spark App N │ ──────────────► │ │
└─────────────────┘ └────────┬─────────┘
│ reads
▼
┌──────────────────┐
│ History Server │
│ (Web UI) │
└──────────────────┘
Configuring Event Logging in Spark Applications
Applications must explicitly enable event logging. Add these properties to spark-defaults.conf on every node that submits Spark applications:
# Enable event logging
spark.eventLog.enabled=true
# Directory for event logs (must be accessible by History Server)
spark.eventLog.dir=hdfs:///spark-history
# Compress logs to reduce storage (recommended)
spark.eventLog.compress=true
# Compression codec (lz4 offers good balance of speed and ratio)
spark.eventLog.compression.codec=lz4
For applications submitted via spark-submit, you can also pass these as command-line arguments:
spark-submit \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs:///spark-history \
--conf spark.eventLog.compress=true \
--class com.example.MyApp \
my-app.jar
Create the event log directory before running applications:
# For HDFS
hdfs dfs -mkdir -p /spark-history
hdfs dfs -chmod 1777 /spark-history
# For local filesystem (testing only)
mkdir -p /var/log/spark-history
chmod 1777 /var/log/spark-history
The sticky bit (1777) allows multiple users to write logs while preventing deletion of others’ files.
History Server Installation and Configuration
The History Server is included with Spark—no separate installation required. Configuration happens in spark-defaults.conf on the machine running the History Server:
# Directory to scan for event logs (must match spark.eventLog.dir)
spark.history.fs.logDirectory=hdfs:///spark-history
# Port for the web UI
spark.history.ui.port=18080
# How often to scan for new logs (in seconds)
spark.history.fs.update.interval=10s
# Number of applications to retain in memory
spark.history.retainedApplications=50
# Maximum disk usage for local cache
spark.history.store.maxDiskUsage=10g
# Path for local cache of event logs
spark.history.store.path=/var/spark-history-cache
# Enable access control lists (optional)
spark.history.ui.acls.enable=false
For memory allocation, set environment variables in spark-env.sh:
# Memory for History Server JVM
export SPARK_HISTORY_OPTS="-Xmx4g -XX:+UseG1GC"
# Alternatively, for more detailed tuning
export SPARK_HISTORY_OPTS="-Xmx4g -Xms2g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
Starting and Managing the History Server
Spark provides convenience scripts for managing the History Server:
# Start the History Server
$SPARK_HOME/sbin/start-history-server.sh
# Stop the History Server
$SPARK_HOME/sbin/stop-history-server.sh
# Check if it's running
jps | grep HistoryServer
# View logs
tail -f $SPARK_HOME/logs/spark-*-org.apache.spark.deploy.history.HistoryServer-*.out
For production deployments, use a systemd service unit:
# /etc/systemd/system/spark-history-server.service
[Unit]
Description=Apache Spark History Server
After=network.target
[Service]
Type=forking
User=spark
Group=spark
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk"
ExecStart=/opt/spark/sbin/start-history-server.sh
ExecStop=/opt/spark/sbin/stop-history-server.sh
Restart=on-failure
RestartSec=30
[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable spark-history-server
sudo systemctl start spark-history-server
sudo systemctl status spark-history-server
Access the UI at http://<history-server-host>:18080.
Cloud Storage Integration (S3/GCS/Azure Blob)
Cloud object storage is the preferred choice for event logs in cloud deployments. It’s durable, cost-effective, and doesn’t require managing HDFS infrastructure.
Amazon S3 Configuration
# spark-defaults.conf
spark.eventLog.dir=s3a://my-bucket/spark-history
spark.history.fs.logDirectory=s3a://my-bucket/spark-history
# S3A filesystem configuration
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider
For IAM role-based authentication (recommended), ensure your EC2 instance or EKS pod has an appropriate IAM role attached. For explicit credentials (not recommended for production):
spark.hadoop.fs.s3a.access.key=YOUR_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key=YOUR_SECRET_KEY
Required JARs for S3 support:
# Download and place in $SPARK_HOME/jars/
hadoop-aws-3.3.4.jar
aws-java-sdk-bundle-1.12.262.jar
Google Cloud Storage Configuration
spark.eventLog.dir=gs://my-bucket/spark-history
spark.history.fs.logDirectory=gs://my-bucket/spark-history
spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.gs.auth.type=SERVICE_ACCOUNT_JSON_KEYFILE
spark.hadoop.fs.gs.auth.service.account.json.keyfile=/path/to/service-account.json
Required JAR: gcs-connector-hadoop3-2.2.11-shaded.jar
Azure Blob Storage Configuration
spark.eventLog.dir=wasbs://container@account.blob.core.windows.net/spark-history
spark.history.fs.logDirectory=wasbs://container@account.blob.core.windows.net/spark-history
spark.hadoop.fs.azure.account.key.account.blob.core.windows.net=YOUR_STORAGE_KEY
Troubleshooting and Best Practices
Common Issues
History Server shows no applications: Verify the log directory paths match exactly between spark.eventLog.dir and spark.history.fs.logDirectory. Check permissions—the History Server user must have read access to the event logs.
# Check HDFS permissions
hdfs dfs -ls /spark-history
# Check S3 access
aws s3 ls s3://my-bucket/spark-history/
UI loads but applications fail to display: This often indicates corrupted or incomplete log files. Check the History Server logs for parsing errors:
grep -i "error\|exception" $SPARK_HOME/logs/spark-*-HistoryServer-*.out
Out of memory errors: Increase History Server heap size and reduce spark.history.retainedApplications:
export SPARK_HISTORY_OPTS="-Xmx8g"
Best Practices
Implement log retention policies: Event logs accumulate quickly. Configure automatic cleanup:
# Enable cleaner
spark.history.fs.cleaner.enabled=true
# Maximum age of log files
spark.history.fs.cleaner.maxAge=30d
# How often to run cleaner
spark.history.fs.cleaner.interval=1d
Secure the UI: In production, place the History Server behind a reverse proxy with authentication:
# nginx configuration snippet
location /spark-history/ {
auth_basic "Spark History Server";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:18080/;
}
Monitor the History Server: Add health checks to your monitoring system:
# Simple health check
curl -s -o /dev/null -w "%{http_code}" http://localhost:18080/api/v1/applications
Separate storage from compute: Use dedicated cloud storage buckets for event logs rather than ephemeral cluster storage. This ensures logs survive cluster termination and allows centralized viewing across multiple clusters.
The Spark History Server is foundational infrastructure for any production Spark deployment. Invest the time to configure it properly, and you’ll have invaluable visibility into your application behavior when debugging issues at 2 AM.