Running Spark Locally Without the Headaches
A minimal local Spark setup for developing and testing pipelines before deploying to a cluster.
Key Insights
pip install pysparkbundles a standalone runtime — you don’t need Hadoop, YARN, or a cluster for development- Reduce shuffle partitions from 200 to 4 locally and set driver memory based on your sample dataset size
- Write unit tests against small DataFrames locally, then deploy to your cluster with confidence
Spark’s documentation makes local development seem harder than it is. You don’t need Hadoop, YARN, or a cluster to develop and test Spark jobs.
Minimal Setup
pip install pyspark
That’s it. PySpark bundles a standalone Spark runtime.
Local SparkSession
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.master("local[*]")
.appName("dev")
.config("spark.sql.shuffle.partitions", 4)
.config("spark.driver.memory", "4g")
.getOrCreate()
)
Key configs for local development:
local[*]uses all available cores- Reduce shuffle partitions from the default 200 — you’re not on a cluster
- Set driver memory based on your dataset size
Testing Pattern
def transform_events(df):
return (
df.filter(df.event_type.isNotNull())
.withColumn("date", df.timestamp.cast("date"))
.groupBy("date", "event_type")
.count()
)
def test_transform_events(spark):
input_df = spark.createDataFrame([
("2026-01-01 10:00:00", "click"),
("2026-01-01 11:00:00", "click"),
("2026-01-01 12:00:00", None),
], ["timestamp", "event_type"])
result = transform_events(input_df)
assert result.count() == 1
assert result.first()["count"] == 2
Reading Local Files
df = spark.read.parquet("data/events/")
df = spark.read.option("header", True).csv("data/users.csv")
df = spark.read.json("data/logs/")
The Workflow
- Develop locally with sample data
- Write unit tests against small DataFrames
- Run integration tests against larger local datasets
- Deploy to your cluster with confidence
Local Spark development is fast feedback. Save the cluster for production runs.