Apache Spark - Docker Setup for Spark
Setting up Apache Spark traditionally involves wrestling with Java versions, Scala dependencies, Hadoop configurations, and environment variables across multiple machines. Docker eliminates this...
Key Insights
- Docker provides the fastest path to a reproducible Spark development environment, eliminating the “works on my machine” problem that plagues distributed systems setup.
- A properly configured docker-compose setup can simulate a multi-node Spark cluster on a single machine, enabling realistic testing of distributed workloads before deploying to production clusters.
- Volume mounts and custom configuration files are essential for practical Spark development—without them, you lose data and settings every time containers restart.
Introduction to Spark on Docker
Setting up Apache Spark traditionally involves wrestling with Java versions, Scala dependencies, Hadoop configurations, and environment variables across multiple machines. Docker eliminates this friction entirely. You define your environment once, and it runs identically everywhere.
Containerizing Spark makes sense for three primary reasons. First, reproducibility—your development environment matches your teammate’s environment matches your CI pipeline. Second, isolation—Spark’s dependencies won’t conflict with other tools on your system. Third, simplified cluster simulation—you can run a multi-node cluster on your laptop without configuring network interfaces or SSH keys.
This setup works best for local development, integration testing, and learning Spark internals. For production workloads processing terabytes of data, you’ll eventually move to Kubernetes or managed services like EMR or Databricks. But for everything else, Docker gets you running in minutes.
Prerequisites and Environment Setup
Before diving in, verify your Docker installation and ensure your system can handle Spark’s memory requirements. Spark is memory-hungry by design—it keeps data in RAM for fast iterative processing.
# Verify Docker installation
docker --version
# Expected: Docker version 24.0.0 or higher
# Check Docker Compose (now integrated into Docker CLI)
docker compose version
# Expected: Docker Compose version v2.20.0 or higher
# Verify Docker daemon is running
docker info | grep "Server Version"
For system requirements, allocate at least 8GB of RAM to Docker. On macOS or Windows, adjust this in Docker Desktop settings under Resources. Each Spark worker needs 1-2GB minimum, and the master needs another 512MB-1GB.
# Check Docker's available resources
docker system info | grep -E "Total Memory|CPUs"
Spark containers communicate over Docker’s internal network. By default, docker-compose creates a bridge network where containers can reach each other by service name. The master listens on port 7077 for worker connections and 8080 for the web UI. Workers expose their own UIs on port 8081.
Single-Node Spark Container Setup
For quick experimentation, a single container with Spark installed is the fastest path to running code. The Bitnami Spark image is well-maintained and includes sensible defaults.
# Pull the Spark image
docker pull bitnami/spark:3.5.1
# Run a single Spark container with interactive shell
docker run -it --rm \
--name spark-standalone \
-p 8080:8080 \
-p 4040:4040 \
-v $(pwd)/data:/opt/spark-data \
-v $(pwd)/apps:/opt/spark-apps \
bitnami/spark:3.5.1 \
spark-shell
This command mounts two local directories: data for input/output files and apps for your Spark applications. The -p flags expose the Spark master UI (8080) and application UI (4040) to your host machine.
For PySpark instead of the Scala shell:
docker run -it --rm \
--name spark-standalone \
-p 8080:8080 \
-p 4040:4040 \
-v $(pwd)/data:/opt/spark-data \
bitnami/spark:3.5.1 \
pyspark
Once inside the shell, verify Spark is working:
# In PySpark shell
df = spark.range(1000)
df.show(5)
df.count()
Multi-Node Cluster with Docker Compose
A single container doesn’t demonstrate Spark’s distributed nature. For realistic development, set up a cluster with one master and multiple workers. Create a docker-compose.yml file:
version: '3.8'
services:
spark-master:
image: bitnami/spark:3.5.1
container_name: spark-master
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
ports:
- "8080:8080"
- "7077:7077"
volumes:
- ./data:/opt/spark-data
- ./apps:/opt/spark-apps
- ./conf:/opt/bitnami/spark/conf
- spark-logs:/opt/bitnami/spark/logs
networks:
- spark-network
spark-worker-1:
image: bitnami/spark:3.5.1
container_name: spark-worker-1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=2
- SPARK_WORKER_WEBUI_PORT=8081
ports:
- "8081:8081"
volumes:
- ./data:/opt/spark-data
- ./apps:/opt/spark-apps
- spark-logs:/opt/bitnami/spark/logs
depends_on:
- spark-master
networks:
- spark-network
spark-worker-2:
image: bitnami/spark:3.5.1
container_name: spark-worker-2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=2
- SPARK_WORKER_WEBUI_PORT=8081
ports:
- "8082:8081"
volumes:
- ./data:/opt/spark-data
- ./apps:/opt/spark-apps
- spark-logs:/opt/bitnami/spark/logs
depends_on:
- spark-master
networks:
- spark-network
networks:
spark-network:
driver: bridge
volumes:
spark-logs:
Start the cluster:
# Start all services
docker compose up -d
# Verify all containers are running
docker compose ps
# Check master logs
docker compose logs spark-master
# Watch worker registration
docker compose logs -f spark-worker-1
Navigate to http://localhost:8080 to see the Spark master UI. You should see both workers registered with their allocated memory and cores.
Submitting and Running Spark Applications
With the cluster running, submit applications using spark-submit. First, create a sample PySpark application in your apps directory:
# apps/word_count.py
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder \
.appName("WordCount") \
.getOrCreate()
# Create sample data
data = [
"Apache Spark is a unified analytics engine",
"Spark provides high-level APIs in Java Scala Python and R",
"Spark powers a stack of libraries for SQL streaming and machine learning"
]
rdd = spark.sparkContext.parallelize(data)
word_counts = rdd \
.flatMap(lambda line: line.lower().split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.sortBy(lambda x: x[1], ascending=False)
print("\nWord Counts:")
for word, count in word_counts.collect():
print(f" {word}: {count}")
spark.stop()
if __name__ == "__main__":
main()
Submit the application to the cluster:
# Submit via docker exec
docker exec spark-master spark-submit \
--master spark://spark-master:7077 \
--deploy-mode client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
/opt/spark-apps/word_count.py
While the job runs, access the application UI at http://localhost:4040 to monitor stages, tasks, and executors. This UI only exists while an application is running.
For interactive development, connect to the master and start a PySpark shell pointing at the cluster:
docker exec -it spark-master pyspark \
--master spark://spark-master:7077 \
--executor-memory 1g
Persistent Storage and Configuration
Docker containers are ephemeral by default. Without volume mounts, you lose everything when containers stop. The docker-compose file above already includes essential mounts, but let’s examine configuration in detail.
Create a custom Spark configuration file:
# conf/spark-defaults.conf
spark.driver.memory 1g
spark.executor.memory 1g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.eventLog.enabled true
spark.eventLog.dir /opt/bitnami/spark/logs
spark.history.fs.logDirectory /opt/bitnami/spark/logs
Add a history server to your docker-compose for viewing completed applications:
spark-history:
image: bitnami/spark:3.5.1
container_name: spark-history
environment:
- SPARK_MODE=master
command: /opt/bitnami/spark/sbin/start-history-server.sh
ports:
- "18080:18080"
volumes:
- ./conf:/opt/bitnami/spark/conf
- spark-logs:/opt/bitnami/spark/logs
depends_on:
- spark-master
networks:
- spark-network
For data persistence, organize your mounted volumes logically:
mkdir -p data/input data/output data/checkpoints apps conf
Troubleshooting and Production Considerations
The most common issue is memory. If you see java.lang.OutOfMemoryError or containers getting killed, increase Docker’s memory allocation or reduce SPARK_WORKER_MEMORY. Check actual usage:
docker stats --no-stream
Network issues manifest as workers failing to register. Verify the network exists and containers can communicate:
# Check network
docker network ls
docker network inspect spark-docker_spark-network
# Test connectivity from worker to master
docker exec spark-worker-1 ping spark-master
If Spark UI shows executors as “dead,” check worker logs for connection timeouts:
docker compose logs spark-worker-1 | grep -i error
For production workloads, Docker Compose on a single machine won’t cut it. Consider Kubernetes with the Spark Operator when you need actual distributed processing across multiple physical nodes, dynamic scaling, and proper resource isolation. Managed services like AWS EMR, Google Dataproc, or Databricks handle the operational complexity for you.
However, this Docker setup remains valuable even with production clusters available. Use it for rapid prototyping, unit testing Spark jobs in CI pipelines, and learning Spark internals without cloud costs. The configuration patterns transfer directly to production—you’re just swapping the cluster manager.