Running Spark Locally Without the Headaches

A minimal local Spark setup for developing and testing pipelines before deploying to a cluster.

Key Insights

  • pip install pyspark bundles a standalone runtime — you don’t need Hadoop, YARN, or a cluster for development
  • Reduce shuffle partitions from 200 to 4 locally and set driver memory based on your sample dataset size
  • Write unit tests against small DataFrames locally, then deploy to your cluster with confidence

Spark’s documentation makes local development seem harder than it is. You don’t need Hadoop, YARN, or a cluster to develop and test Spark jobs.

Minimal Setup

pip install pyspark

That’s it. PySpark bundles a standalone Spark runtime.

Local SparkSession

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("dev")
    .config("spark.sql.shuffle.partitions", 4)
    .config("spark.driver.memory", "4g")
    .getOrCreate()
)

Key configs for local development:

  • local[*] uses all available cores
  • Reduce shuffle partitions from the default 200 — you’re not on a cluster
  • Set driver memory based on your dataset size

Testing Pattern

def transform_events(df):
    return (
        df.filter(df.event_type.isNotNull())
        .withColumn("date", df.timestamp.cast("date"))
        .groupBy("date", "event_type")
        .count()
    )

def test_transform_events(spark):
    input_df = spark.createDataFrame([
        ("2026-01-01 10:00:00", "click"),
        ("2026-01-01 11:00:00", "click"),
        ("2026-01-01 12:00:00", None),
    ], ["timestamp", "event_type"])

    result = transform_events(input_df)
    assert result.count() == 1
    assert result.first()["count"] == 2

Reading Local Files

df = spark.read.parquet("data/events/")
df = spark.read.option("header", True).csv("data/users.csv")
df = spark.read.json("data/logs/")

The Workflow

  1. Develop locally with sample data
  2. Write unit tests against small DataFrames
  3. Run integration tests against larger local datasets
  4. Deploy to your cluster with confidence

Local Spark development is fast feedback. Save the cluster for production runs.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.