Apache Spark - Install on Local Machine

Key Insights

Apache Spark requires Java JDK (version 8, 11, or 17) as a prerequisite—getting JAVA_HOME configured correctly solves 80% of installation headaches
The standalone Spark installation gives you spark-shell, pyspark, and spark-submit out of the box, making it superior to pip-only installations for learning and development
Environment variable configuration is where most developers stumble; copy the exact export statements and you’ll have a working installation in under 10 minutes

Introduction & Prerequisites

Apache Spark is a distributed computing framework that processes large datasets across clusters. But here’s the thing—you don’t need a cluster to learn Spark or develop applications. A local installation on your laptop is the fastest way to prototype, test transformations, and understand Spark’s execution model before deploying to production clusters.

Running Spark locally means you get the full API surface, the Spark UI for debugging, and an environment that mirrors production behavior (minus the distributed overhead). Whether you’re building ETL pipelines, training ML models with MLlib, or processing streaming data, local development is where it starts.

Before we begin, ensure you have:

Java JDK 8, 11, or 17 (LTS versions recommended; Spark 3.5+ supports Java 17)
4GB+ RAM (8GB recommended for comfortable development)
2GB+ free disk space for Spark binaries and temporary files
Python 3.8+ (optional, required only for PySpark)
macOS, Linux, or Windows 10/11 with admin/sudo access

Installing Java JDK

Spark runs on the JVM, so Java is non-negotiable. I recommend OpenJDK 11 for maximum compatibility with Spark 3.x versions.

On macOS (using Homebrew):

brew install openjdk@11
sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk

On Ubuntu/Debian:

sudo apt update
sudo apt install openjdk-11-jdk

On Windows:

Download the installer from Adoptium and run it. Choose the MSI installer for automatic PATH configuration.

After installation, configure JAVA_HOME. This environment variable tells Spark (and other tools) where Java lives:

# Find your Java installation path
# macOS
/usr/libexec/java_home -V

# Linux
update-alternatives --list java

Verify everything works:

java -version
# Expected output: openjdk version "11.0.x" ...

echo $JAVA_HOME
# Should print your Java installation path, e.g., /usr/lib/jvm/java-11-openjdk-amd64

If JAVA_HOME is empty, you’ll need to set it in your shell profile (covered in Section 4).

Downloading and Installing Spark

Head to the Apache Spark downloads page and select:

Spark release: Choose the latest stable version (3.5.x as of writing)
Package type: “Pre-built for Apache Hadoop 3.3 and later” works for most use cases

Download directly via command line:

# Download Spark 3.5.3 (adjust version as needed)
cd ~/Downloads
wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

# Or use curl on macOS
curl -O https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

Extract and move to a permanent location:

# Extract the archive
tar -xzf spark-3.5.3-bin-hadoop3.tgz

# Move to /opt (Linux/macOS) or a location of your choice
sudo mv spark-3.5.3-bin-hadoop3 /opt/spark

# Or keep it in your home directory (no sudo required)
mv spark-3.5.3-bin-hadoop3 ~/spark

I prefer /opt/spark on Linux/macOS because it’s the conventional location for third-party software and keeps your home directory clean.

On Windows:

Extract the downloaded .tgz file using 7-Zip or similar. Move the extracted folder to C:\spark. Avoid paths with spaces—they cause endless headaches with Spark scripts.

Configuring Environment Variables

This is where installations succeed or fail. You need to set SPARK_HOME and add Spark’s bin directory to your PATH.

For Bash (Linux, older macOS):

Add these lines to ~/.bashrc:

# Java configuration
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

# Spark configuration
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

For Zsh (macOS default, modern Linux):

Add to ~/.zshrc:

# Java configuration (macOS with Homebrew)
export JAVA_HOME=$(/usr/libexec/java_home -v 11)

# Spark configuration
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Reload your shell configuration:

source ~/.bashrc  # or source ~/.zshrc

For Windows (PowerShell):

Set environment variables through System Properties, or use PowerShell:

# Run as Administrator
[Environment]::SetEnvironmentVariable("SPARK_HOME", "C:\spark", "Machine")
[Environment]::SetEnvironmentVariable("JAVA_HOME", "C:\Program Files\Eclipse Adoptium\jdk-11.0.20.8-hotspot", "Machine")

# Add to PATH
$currentPath = [Environment]::GetEnvironmentVariable("Path", "Machine")
[Environment]::SetEnvironmentVariable("Path", "$currentPath;C:\spark\bin", "Machine")

Restart your terminal after setting Windows environment variables.

Verifying the Installation

Time to confirm everything works. Start the Spark shell:

spark-shell

You should see Spark’s ASCII art logo and a Scala prompt. If you get errors about Java or missing classes, revisit your environment variables.

Run a quick test to ensure Spark is functional:

// Create a simple RDD
val numbers = sc.parallelize(1 to 100)
val sum = numbers.reduce(_ + _)
println(s"Sum of 1 to 100: $sum")
// Output: Sum of 1 to 100: 5050

// Test DataFrame API
val df = spark.range(10).toDF("number")
df.show()

// Exit the shell
:quit

While spark-shell is running, open your browser to http://localhost:4040. This is the Spark UI—your window into job execution, stages, storage, and environment configuration. Bookmark it; you’ll use it constantly for debugging.

Test PySpark as well:

pyspark

# In the PySpark shell
df = spark.createDataFrame([(1, "alice"), (2, "bob")], ["id", "name"])
df.show()

# Test a transformation
df.filter(df.id > 1).show()

exit()

Optional: Installing PySpark via pip

If you only need PySpark and want to skip the full Spark installation, pip works:

pip install pyspark

This installs PySpark as a Python package with bundled Spark binaries. It’s convenient but has limitations:

Use pip install when:

You’re building Python-only applications
You need PySpark in a virtual environment or container
You want version pinning via requirements.txt

Use standalone installation when:

You need spark-shell (Scala REPL)
You want spark-submit for deploying applications
You’re learning Spark and want the full toolkit
You need to customize Spark configuration files

A basic PySpark script with pip-installed PySpark:

# test_pyspark.py
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("LocalTest") \
    .master("local[*]") \
    .getOrCreate()

data = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]
df = spark.createDataFrame(data, ["language", "users"])

df.show()
df.groupBy("language").sum("users").show()

spark.stop()

Run it with:

python test_pyspark.py

Troubleshooting Common Issues

“JAVA_HOME is not set”

This is the most common error. Double-check your shell profile has the export statement and you’ve reloaded it:

# Verify JAVA_HOME is set
echo $JAVA_HOME

# If empty, check your .bashrc/.zshrc has the export
grep JAVA_HOME ~/.bashrc ~/.zshrc

# Reload
source ~/.bashrc

Java version mismatch

Spark 3.x requires Java 8, 11, or 17. Java 18+ causes issues:

# Check your version
java -version

# If you have multiple Java versions, explicitly set JAVA_HOME to a compatible one
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Out of memory errors

Spark’s default memory allocation is conservative. Create or edit $SPARK_HOME/conf/spark-defaults.conf:

spark.driver.memory              4g
spark.executor.memory            4g
spark.driver.maxResultSize       2g

If spark-defaults.conf doesn’t exist, copy the template:

cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

Permission denied on /opt/spark

Either use sudo for operations, or change ownership:

sudo chown -R $USER:$USER /opt/spark

Windows: “winutils.exe not found”

Spark on Windows needs Hadoop’s winutils.exe. Download it from a winutils repository, place it in C:\hadoop\bin, and set:

[Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\hadoop", "Machine")

Port 4040 already in use

Another Spark session is running, or the port is occupied. Either stop the other session or specify a different port:

spark-shell --conf spark.ui.port=4041

You now have a fully functional local Spark installation. Start building data pipelines, experiment with MLlib, or prototype your next big data application—all from your laptop.