Running Spark Locally Without the Headaches

Key Insights

pip install pyspark bundles a standalone runtime — you don’t need Hadoop, YARN, or a cluster for development
Reduce shuffle partitions from 200 to 4 locally and set driver memory based on your sample dataset size
Write unit tests against small DataFrames locally, then deploy to your cluster with confidence

Spark’s documentation makes local development seem harder than it is. You don’t need Hadoop, YARN, or a cluster to develop and test Spark jobs.

Minimal Setup

pip install pyspark

That’s it. PySpark bundles a standalone Spark runtime.

Local SparkSession

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("dev")
    .config("spark.sql.shuffle.partitions", 4)
    .config("spark.driver.memory", "4g")
    .getOrCreate()
)

Key configs for local development:

local[*] uses all available cores
Reduce shuffle partitions from the default 200 — you’re not on a cluster
Set driver memory based on your dataset size

Testing Pattern

def transform_events(df):
    return (
        df.filter(df.event_type.isNotNull())
        .withColumn("date", df.timestamp.cast("date"))
        .groupBy("date", "event_type")
        .count()
    )

def test_transform_events(spark):
    input_df = spark.createDataFrame([
        ("2026-01-01 10:00:00", "click"),
        ("2026-01-01 11:00:00", "click"),
        ("2026-01-01 12:00:00", None),
    ], ["timestamp", "event_type"])

    result = transform_events(input_df)
    assert result.count() == 1
    assert result.first()["count"] == 2

Reading Local Files

df = spark.read.parquet("data/events/")
df = spark.read.option("header", True).csv("data/users.csv")
df = spark.read.json("data/logs/")

The Workflow

Develop locally with sample data
Write unit tests against small DataFrames
Run integration tests against larger local datasets
Deploy to your cluster with confidence

Local Spark development is fast feedback. Save the cluster for production runs.

Minimal Setup

Local SparkSession

Testing Pattern

Reading Local Files

The Workflow

Liked this? There's more.

Similar Articles