PySpark vs Spark Scala - Performance Comparison
Every data engineering team eventually has this argument: should we write our Spark jobs in PySpark or Scala? The Scala advocates cite 'native JVM performance.' The Python camp points to faster...
Key Insights
- For DataFrame and SQL operations, PySpark and Scala Spark perform nearly identically because both compile to the same JVM execution plan through Catalyst optimizer—language choice doesn’t matter here.
- Python UDFs are the performance killer, introducing 2-10x overhead due to serialization costs; Pandas UDFs with Apache Arrow reduce this to 1.2-1.5x overhead.
- Choose based on team expertise and ecosystem needs, not perceived performance—most production workloads spend 90%+ of execution time in JVM-optimized code paths regardless of language.
The Language Debate That Won’t Die
Every data engineering team eventually has this argument: should we write our Spark jobs in PySpark or Scala? The Scala advocates cite “native JVM performance.” The Python camp points to faster development and better ML library integration. Both sides are partially right and mostly missing the point.
The truth is nuanced. Performance differences exist, but they’re concentrated in specific patterns. Understanding where those differences actually matter—and where they don’t—will save you from premature optimization and misguided rewrites.
Architecture Deep Dive: How PySpark Actually Works
Before benchmarking anything, you need to understand PySpark’s execution model. PySpark isn’t “Python running on Spark.” It’s a Python API that generates JVM execution plans.
When you write PySpark code, Py4J creates a bridge between your Python driver and the JVM. DataFrame operations like filter(), groupBy(), and join() don’t execute Python code—they build a logical plan that Catalyst optimizer compiles into JVM bytecode.
# PySpark - generates JVM execution plan
df = spark.read.parquet("data/events")
result = (df
.filter(df.event_type == "purchase")
.groupBy("user_id")
.agg({"amount": "sum"})
)
result.explain()
// Scala - generates identical JVM execution plan
val df = spark.read.parquet("data/events")
val result = df
.filter($"event_type" === "purchase")
.groupBy("user_id")
.agg(sum("amount"))
result.explain()
Both produce the same physical plan:
== Physical Plan ==
*(2) HashAggregate(keys=[user_id], functions=[sum(amount)])
+- Exchange hashpartitioning(user_id, 200)
+- *(1) HashAggregate(keys=[user_id], functions=[partial_sum(amount)])
+- *(1) Filter (event_type = purchase)
+- *(1) FileScan parquet [user_id,event_type,amount]
The Python boundary crossing only happens when you force data into Python—through UDFs, collect(), or RDD operations with Python lambdas. This is where performance diverges.
Benchmark Methodology
I ran benchmarks on a 10-node cluster (r5.2xlarge instances, 64GB RAM each) using three dataset sizes:
- Small: 10 million rows, 2GB Parquet
- Medium: 500 million rows, 100GB Parquet
- Large: 5 billion rows, 1TB Parquet
Each test ran 5 times after a warmup run. I measured wall-clock time from job submission to completion.
# PySpark benchmark harness
import time
from functools import wraps
def benchmark(name, runs=5):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
times = []
# Warmup
func(*args, **kwargs)
for _ in range(runs):
start = time.perf_counter()
result = func(*args, **kwargs)
result.write.mode("overwrite").parquet(f"/tmp/{name}")
elapsed = time.perf_counter() - start
times.append(elapsed)
avg = sum(times) / len(times)
print(f"{name}: {avg:.2f}s (±{max(times)-min(times):.2f}s)")
return result
return wrapper
return decorator
// Scala benchmark harness
def benchmark[T](name: String, runs: Int = 5)(block: => DataFrame): DataFrame = {
// Warmup
block.write.mode("overwrite").parquet(s"/tmp/$name-warmup")
val times = (1 to runs).map { _ =>
val start = System.nanoTime()
val result = block
result.write.mode("overwrite").parquet(s"/tmp/$name")
(System.nanoTime() - start) / 1e9
}
val avg = times.sum / times.length
println(f"$name: $avg%.2fs (±${times.max - times.min}%.2fs)")
block
}
Performance Results: Where They’re Equal
For pure DataFrame operations, the performance difference is negligible—typically within measurement noise (±3%).
# Complex aggregation + join pipeline - PySpark
@benchmark("complex_pipeline")
def complex_pipeline():
orders = spark.read.parquet("data/orders")
customers = spark.read.parquet("data/customers")
order_stats = (orders
.filter(col("status") == "completed")
.groupBy("customer_id", "category")
.agg(
count("*").alias("order_count"),
sum("total").alias("revenue"),
avg("total").alias("avg_order_value"),
max("order_date").alias("last_order")
)
.filter(col("order_count") >= 5)
)
return (order_stats
.join(customers, "customer_id")
.groupBy("region", "category")
.agg(
sum("revenue").alias("total_revenue"),
avg("avg_order_value").alias("avg_aov")
)
.orderBy(desc("total_revenue"))
)
Results on medium dataset (500M rows):
| Operation | PySpark | Scala | Difference |
|---|---|---|---|
| Complex aggregation + join | 47.3s | 46.8s | +1.1% |
| Window functions | 62.1s | 61.4s | +1.1% |
| Multi-table join (5 tables) | 89.2s | 88.7s | +0.6% |
| Spark SQL query | 45.1s | 45.0s | +0.2% |
The Catalyst optimizer doesn’t care what language generated the logical plan. It produces identical execution strategies.
Performance Results: Where Scala Wins
The gap appears when Python code must execute on workers. UDFs are the primary culprit.
# Python UDF - slow path
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(StringType())
def categorize_amount(amount):
if amount > 1000:
return "high"
elif amount > 100:
return "medium"
return "low"
# This serializes data to Python, executes, serializes back
result = df.withColumn("category", categorize_amount(col("amount")))
// Scala UDF - stays in JVM
val categorizeAmount = udf((amount: Double) => {
if (amount > 1000) "high"
else if (amount > 100) "medium"
else "low"
})
val result = df.withColumn("category", categorizeAmount($"amount"))
UDF Performance (medium dataset):
| UDF Type | PySpark | Scala | Overhead |
|---|---|---|---|
| Simple scalar UDF | 142s | 23s | 6.2x |
| String manipulation UDF | 198s | 31s | 6.4x |
| Complex business logic UDF | 267s | 52s | 5.1x |
RDD operations with Python lambdas show similar overhead:
# Python RDD operations - crosses boundary repeatedly
rdd = spark.sparkContext.textFile("data/logs")
result = (rdd
.map(lambda line: parse_log(line)) # Python execution
.filter(lambda x: x['status'] == 200) # Python execution
.map(lambda x: (x['endpoint'], 1)) # Python execution
.reduceByKey(lambda a, b: a + b) # Python execution
)
For iterative ML algorithms that can’t use MLlib’s optimized implementations, Scala provides 3-5x better performance due to eliminated serialization overhead per iteration.
Optimization Strategies for PySpark
When you must use Python logic on workers, Pandas UDFs with Apache Arrow dramatically reduce overhead.
# Before: Standard Python UDF (slow)
@udf(DoubleType())
def calculate_score_slow(feature1, feature2, feature3):
# Complex scoring logic
base = feature1 * 0.4 + feature2 * 0.35 + feature3 * 0.25
return base * (1 + np.log1p(feature1) * 0.1)
# After: Pandas UDF with Arrow (fast)
from pyspark.sql.functions import pandas_udf
@pandas_udf(DoubleType())
def calculate_score_fast(
feature1: pd.Series,
feature2: pd.Series,
feature3: pd.Series
) -> pd.Series:
base = feature1 * 0.4 + feature2 * 0.35 + feature3 * 0.25
return base * (1 + np.log1p(feature1) * 0.1)
Before/After on medium dataset:
| Approach | Time | Overhead vs Scala |
|---|---|---|
| Python UDF | 156s | 5.8x |
| Pandas UDF | 32s | 1.2x |
| Native Scala UDF | 27s | 1.0x |
Additional optimization strategies:
- Replace UDFs with built-in functions when possible—they execute entirely in JVM
- Use
mapInPandasfor complex row-wise operations that need Python libraries - Batch operations to minimize boundary crossings
- Drop to Scala for specific hot paths using
spark.udf.registerJavaFunction
# Register a Scala UDF from PySpark for critical paths
spark.udf.registerJavaFunction(
"fast_categorize",
"com.company.udfs.CategorizeAmount",
StringType()
)
# Use via SQL - executes in JVM
result = spark.sql("""
SELECT *, fast_categorize(amount) as category
FROM orders
""")
Decision Framework & Recommendations
Here’s my practical guidance for language selection:
Choose PySpark when:
- Your team has stronger Python expertise
- You need tight integration with pandas, scikit-learn, or other Python ML libraries
- Your workload is 90%+ DataFrame/SQL operations
- Development velocity matters more than squeezing out last 10% performance
Choose Scala when:
- You’re building a shared library/framework used across many jobs
- Heavy UDF usage is unavoidable and performance-critical
- You need custom Spark extensions or data sources
- Your team already knows Scala
Hybrid approach (often optimal):
- Write pipeline orchestration and most transformations in PySpark
- Implement performance-critical UDFs in Scala, register as UDFs
- Use Pandas UDFs for Python ML model inference
The uncomfortable truth: most teams debating PySpark vs Scala performance would get better ROI from optimizing their data layout, partition strategy, or cluster configuration. Language choice rarely matters as much as we think—unless you’re writing lots of UDFs, in which case the answer is clear: use Pandas UDFs or write them in Scala.
Stop rewriting working PySpark jobs in Scala for “performance.” Start profiling to find where time actually goes.