Apache Spark - Docker Setup for Spark

Setting up Apache Spark traditionally involves wrestling with Java versions, Scala dependencies, Hadoop configurations, and environment variables across multiple machines. Docker eliminates this...

Key Insights

  • Docker provides the fastest path to a reproducible Spark development environment, eliminating the “works on my machine” problem that plagues distributed systems setup.
  • A properly configured docker-compose setup can simulate a multi-node Spark cluster on a single machine, enabling realistic testing of distributed workloads before deploying to production clusters.
  • Volume mounts and custom configuration files are essential for practical Spark development—without them, you lose data and settings every time containers restart.

Introduction to Spark on Docker

Setting up Apache Spark traditionally involves wrestling with Java versions, Scala dependencies, Hadoop configurations, and environment variables across multiple machines. Docker eliminates this friction entirely. You define your environment once, and it runs identically everywhere.

Containerizing Spark makes sense for three primary reasons. First, reproducibility—your development environment matches your teammate’s environment matches your CI pipeline. Second, isolation—Spark’s dependencies won’t conflict with other tools on your system. Third, simplified cluster simulation—you can run a multi-node cluster on your laptop without configuring network interfaces or SSH keys.

This setup works best for local development, integration testing, and learning Spark internals. For production workloads processing terabytes of data, you’ll eventually move to Kubernetes or managed services like EMR or Databricks. But for everything else, Docker gets you running in minutes.

Prerequisites and Environment Setup

Before diving in, verify your Docker installation and ensure your system can handle Spark’s memory requirements. Spark is memory-hungry by design—it keeps data in RAM for fast iterative processing.

# Verify Docker installation
docker --version
# Expected: Docker version 24.0.0 or higher

# Check Docker Compose (now integrated into Docker CLI)
docker compose version
# Expected: Docker Compose version v2.20.0 or higher

# Verify Docker daemon is running
docker info | grep "Server Version"

For system requirements, allocate at least 8GB of RAM to Docker. On macOS or Windows, adjust this in Docker Desktop settings under Resources. Each Spark worker needs 1-2GB minimum, and the master needs another 512MB-1GB.

# Check Docker's available resources
docker system info | grep -E "Total Memory|CPUs"

Spark containers communicate over Docker’s internal network. By default, docker-compose creates a bridge network where containers can reach each other by service name. The master listens on port 7077 for worker connections and 8080 for the web UI. Workers expose their own UIs on port 8081.

Single-Node Spark Container Setup

For quick experimentation, a single container with Spark installed is the fastest path to running code. The Bitnami Spark image is well-maintained and includes sensible defaults.

# Pull the Spark image
docker pull bitnami/spark:3.5.1

# Run a single Spark container with interactive shell
docker run -it --rm \
  --name spark-standalone \
  -p 8080:8080 \
  -p 4040:4040 \
  -v $(pwd)/data:/opt/spark-data \
  -v $(pwd)/apps:/opt/spark-apps \
  bitnami/spark:3.5.1 \
  spark-shell

This command mounts two local directories: data for input/output files and apps for your Spark applications. The -p flags expose the Spark master UI (8080) and application UI (4040) to your host machine.

For PySpark instead of the Scala shell:

docker run -it --rm \
  --name spark-standalone \
  -p 8080:8080 \
  -p 4040:4040 \
  -v $(pwd)/data:/opt/spark-data \
  bitnami/spark:3.5.1 \
  pyspark

Once inside the shell, verify Spark is working:

# In PySpark shell
df = spark.range(1000)
df.show(5)
df.count()

Multi-Node Cluster with Docker Compose

A single container doesn’t demonstrate Spark’s distributed nature. For realistic development, set up a cluster with one master and multiple workers. Create a docker-compose.yml file:

version: '3.8'

services:
  spark-master:
    image: bitnami/spark:3.5.1
    container_name: spark-master
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8080
    ports:
      - "8080:8080"
      - "7077:7077"
    volumes:
      - ./data:/opt/spark-data
      - ./apps:/opt/spark-apps
      - ./conf:/opt/bitnami/spark/conf
      - spark-logs:/opt/bitnami/spark/logs
    networks:
      - spark-network

  spark-worker-1:
    image: bitnami/spark:3.5.1
    container_name: spark-worker-1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_WEBUI_PORT=8081
    ports:
      - "8081:8081"
    volumes:
      - ./data:/opt/spark-data
      - ./apps:/opt/spark-apps
      - spark-logs:/opt/bitnami/spark/logs
    depends_on:
      - spark-master
    networks:
      - spark-network

  spark-worker-2:
    image: bitnami/spark:3.5.1
    container_name: spark-worker-2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_WEBUI_PORT=8081
    ports:
      - "8082:8081"
    volumes:
      - ./data:/opt/spark-data
      - ./apps:/opt/spark-apps
      - spark-logs:/opt/bitnami/spark/logs
    depends_on:
      - spark-master
    networks:
      - spark-network

networks:
  spark-network:
    driver: bridge

volumes:
  spark-logs:

Start the cluster:

# Start all services
docker compose up -d

# Verify all containers are running
docker compose ps

# Check master logs
docker compose logs spark-master

# Watch worker registration
docker compose logs -f spark-worker-1

Navigate to http://localhost:8080 to see the Spark master UI. You should see both workers registered with their allocated memory and cores.

Submitting and Running Spark Applications

With the cluster running, submit applications using spark-submit. First, create a sample PySpark application in your apps directory:

# apps/word_count.py
from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder \
        .appName("WordCount") \
        .getOrCreate()
    
    # Create sample data
    data = [
        "Apache Spark is a unified analytics engine",
        "Spark provides high-level APIs in Java Scala Python and R",
        "Spark powers a stack of libraries for SQL streaming and machine learning"
    ]
    
    rdd = spark.sparkContext.parallelize(data)
    
    word_counts = rdd \
        .flatMap(lambda line: line.lower().split()) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a + b) \
        .sortBy(lambda x: x[1], ascending=False)
    
    print("\nWord Counts:")
    for word, count in word_counts.collect():
        print(f"  {word}: {count}")
    
    spark.stop()

if __name__ == "__main__":
    main()

Submit the application to the cluster:

# Submit via docker exec
docker exec spark-master spark-submit \
  --master spark://spark-master:7077 \
  --deploy-mode client \
  --driver-memory 1g \
  --executor-memory 1g \
  --executor-cores 1 \
  /opt/spark-apps/word_count.py

While the job runs, access the application UI at http://localhost:4040 to monitor stages, tasks, and executors. This UI only exists while an application is running.

For interactive development, connect to the master and start a PySpark shell pointing at the cluster:

docker exec -it spark-master pyspark \
  --master spark://spark-master:7077 \
  --executor-memory 1g

Persistent Storage and Configuration

Docker containers are ephemeral by default. Without volume mounts, you lose everything when containers stop. The docker-compose file above already includes essential mounts, but let’s examine configuration in detail.

Create a custom Spark configuration file:

# conf/spark-defaults.conf
spark.driver.memory              1g
spark.executor.memory            1g
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.sql.adaptive.enabled       true
spark.sql.adaptive.coalescePartitions.enabled true
spark.eventLog.enabled           true
spark.eventLog.dir               /opt/bitnami/spark/logs
spark.history.fs.logDirectory    /opt/bitnami/spark/logs

Add a history server to your docker-compose for viewing completed applications:

  spark-history:
    image: bitnami/spark:3.5.1
    container_name: spark-history
    environment:
      - SPARK_MODE=master
    command: /opt/bitnami/spark/sbin/start-history-server.sh
    ports:
      - "18080:18080"
    volumes:
      - ./conf:/opt/bitnami/spark/conf
      - spark-logs:/opt/bitnami/spark/logs
    depends_on:
      - spark-master
    networks:
      - spark-network

For data persistence, organize your mounted volumes logically:

mkdir -p data/input data/output data/checkpoints apps conf

Troubleshooting and Production Considerations

The most common issue is memory. If you see java.lang.OutOfMemoryError or containers getting killed, increase Docker’s memory allocation or reduce SPARK_WORKER_MEMORY. Check actual usage:

docker stats --no-stream

Network issues manifest as workers failing to register. Verify the network exists and containers can communicate:

# Check network
docker network ls
docker network inspect spark-docker_spark-network

# Test connectivity from worker to master
docker exec spark-worker-1 ping spark-master

If Spark UI shows executors as “dead,” check worker logs for connection timeouts:

docker compose logs spark-worker-1 | grep -i error

For production workloads, Docker Compose on a single machine won’t cut it. Consider Kubernetes with the Spark Operator when you need actual distributed processing across multiple physical nodes, dynamic scaling, and proper resource isolation. Managed services like AWS EMR, Google Dataproc, or Databricks handle the operational complexity for you.

However, this Docker setup remains valuable even with production clusters available. Use it for rapid prototyping, unit testing Spark jobs in CI pipelines, and learning Spark internals without cloud costs. The configuration patterns transfer directly to production—you’re just swapping the cluster manager.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.