GroupBy in PySpark vs Pandas vs SQL - Comparison
The groupby operation is fundamental to data analysis. Whether you're calculating revenue by region, counting users by signup date, or computing average order values by customer segment, you're...
Key Insights
- Pandas groupby excels for datasets under 10GB on a single machine, offering the most intuitive syntax and fastest iteration for exploratory analysis
- PySpark’s lazy evaluation and distributed processing make it the clear choice for datasets exceeding memory limits, but the overhead isn’t worth it for small data
- SQL remains the most portable and often most readable option, especially when your data already lives in a database—don’t move data unnecessarily
Introduction
The groupby operation is fundamental to data analysis. Whether you’re calculating revenue by region, counting users by signup date, or computing average order values by customer segment, you’re grouping and aggregating data. It’s one of the first operations analysts learn and one of the most frequently used in production pipelines.
Pandas, PySpark, and SQL each handle groupby differently, and choosing the wrong tool costs you either development time or compute resources. I’ve seen teams struggle with Pandas on datasets that should have been in Spark, and I’ve watched engineers spin up Spark clusters for 50MB CSV files. Both mistakes are expensive.
This article gives you concrete syntax comparisons, performance characteristics, and a decision framework so you can pick the right tool without guessing.
Syntax Comparison
Let’s start with the basics. Here’s how you’d sum sales by region in each technology:
Pandas:
import pandas as pd
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'East', 'South'],
'sales': [100, 200, 150, 300, 250]
})
result = df.groupby('region')['sales'].sum()
PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as spark_sum
spark = SparkSession.builder.appName("groupby_demo").getOrCreate()
df = spark.createDataFrame([
('North', 100), ('South', 200), ('North', 150),
('East', 300), ('South', 250)
], ['region', 'sales'])
result = df.groupBy('region').agg(spark_sum('sales').alias('total_sales'))
SQL:
SELECT region, SUM(sales) AS total_sales
FROM sales_table
GROUP BY region;
The SQL syntax is the most declarative and readable. Pandas is concise but relies on method chaining that can get unwieldy. PySpark is verbose but explicit about what’s happening.
Notice PySpark uses groupBy (camelCase) while Pandas uses groupby (lowercase). This trips up developers constantly when switching between the two.
Aggregation Functions
All three support standard aggregations, but the syntax for combining multiple operations differs significantly.
Pandas - Multiple Aggregations:
result = df.groupby('region').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean'),
transaction_count=('sales', 'count'),
max_sale=('sales', 'max')
)
PySpark - Multiple Aggregations:
from pyspark.sql.functions import sum as spark_sum, avg, count, max as spark_max
result = df.groupBy('region').agg(
spark_sum('sales').alias('total_sales'),
avg('sales').alias('avg_sales'),
count('sales').alias('transaction_count'),
spark_max('sales').alias('max_sale')
)
SQL - Multiple Aggregations:
SELECT
region,
SUM(sales) AS total_sales,
AVG(sales) AS avg_sales,
COUNT(sales) AS transaction_count,
MAX(sales) AS max_sale
FROM sales_table
GROUP BY region;
Custom aggregations reveal bigger differences. Pandas makes this straightforward:
def range_func(x):
return x.max() - x.min()
result = df.groupby('region')['sales'].agg(range_func)
PySpark requires User Defined Functions (UDFs), which are slower because they serialize data between JVM and Python:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
@udf(returnType=DoubleType())
def range_udf(values):
return float(max(values) - min(values))
# UDFs on grouped data require collect_list first
from pyspark.sql.functions import collect_list
result = df.groupBy('region').agg(
range_udf(collect_list('sales')).alias('sales_range')
)
In SQL, custom aggregations typically require database-specific syntax or creating stored procedures. Most databases don’t support arbitrary Python functions inline.
Performance and Scalability
Here’s where the tools diverge dramatically. Pandas loads everything into memory on a single machine. PySpark distributes data across a cluster. SQL pushes computation to the database engine.
Pandas Performance Characteristics:
import time
# Generate test data
large_df = pd.DataFrame({
'region': ['North', 'South', 'East', 'West'] * 2_500_000,
'sales': range(10_000_000)
})
start = time.time()
result = large_df.groupby('region')['sales'].sum()
print(f"Pandas groupby: {time.time() - start:.2f}s")
# Typical output: Pandas groupby: 0.45s for 10M rows
PySpark Execution Plan:
# Understanding what PySpark actually does
result = df.groupBy('region').agg(spark_sum('sales'))
result.explain(True)
Output shows the physical plan:
== Physical Plan ==
*(2) HashAggregate(keys=[region#0], functions=[sum(sales#1)])
+- Exchange hashpartitioning(region#0, 200)
+- *(1) HashAggregate(keys=[region#0], functions=[partial_sum(sales#1)])
+- *(1) Scan ExistingRDD[region#0,sales#1]
The Exchange step is a shuffle—data moves between partitions. This is expensive on large datasets but enables processing data that doesn’t fit in memory.
Rule of thumb: Use Pandas for data under 10GB. Use PySpark when data exceeds single-machine memory or when you need to integrate with existing Spark infrastructure. Use SQL when data lives in a database and you want to avoid moving it.
Advanced GroupBy Operations
Real analysis rarely stops at simple aggregations. Here’s how each technology handles more complex requirements.
Multi-Column Grouping:
# Pandas
result = df.groupby(['region', 'product_category'])['sales'].sum()
# PySpark
result = df.groupBy('region', 'product_category').agg(spark_sum('sales'))
-- SQL
SELECT region, product_category, SUM(sales)
FROM sales_table
GROUP BY region, product_category;
Filtering Groups (HAVING equivalent):
# Pandas - filter groups with total sales > 1000
result = df.groupby('region').filter(lambda x: x['sales'].sum() > 1000)
# PySpark
from pyspark.sql.functions import sum as spark_sum
result = df.groupBy('region').agg(
spark_sum('sales').alias('total_sales')
).filter('total_sales > 1000')
-- SQL
SELECT region, SUM(sales) AS total_sales
FROM sales_table
GROUP BY region
HAVING SUM(sales) > 1000;
Transform - Broadcast Aggregation Back to Original Rows:
This pattern is common: calculate a group statistic and add it as a column to every row.
# Pandas transform
df['region_total'] = df.groupby('region')['sales'].transform('sum')
df['pct_of_region'] = df['sales'] / df['region_total']
# PySpark window function equivalent
from pyspark.sql.window import Window
from pyspark.sql.functions import sum as spark_sum, col
window_spec = Window.partitionBy('region')
df_with_total = df.withColumn(
'region_total', spark_sum('sales').over(window_spec)
).withColumn(
'pct_of_region', col('sales') / col('region_total')
)
-- SQL window function
SELECT
*,
SUM(sales) OVER (PARTITION BY region) AS region_total,
sales * 1.0 / SUM(sales) OVER (PARTITION BY region) AS pct_of_region
FROM sales_table;
Practical Considerations
Handling Null Values:
Each technology treats nulls differently in groupby operations:
# Pandas - nulls excluded from groups by default
df_with_nulls = pd.DataFrame({
'region': ['North', None, 'North', 'South'],
'sales': [100, 200, 150, 250]
})
# Include nulls in grouping
result = df_with_nulls.groupby('region', dropna=False)['sales'].sum()
# PySpark - nulls form their own group by default
df_spark = spark.createDataFrame([
('North', 100), (None, 200), ('North', 150), ('South', 250)
], ['region', 'sales'])
result = df_spark.groupBy('region').agg(spark_sum('sales'))
# Row with null region appears as separate group
-- SQL - nulls grouped together (behavior varies by database)
SELECT region, SUM(sales)
FROM sales_table
GROUP BY region;
-- NULL region typically appears as one group
Converting Between Formats:
# Pandas to PySpark
spark_df = spark.createDataFrame(pandas_df)
# PySpark to Pandas (careful with large datasets!)
pandas_df = spark_df.toPandas()
# SQL result to Pandas
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://...')
pandas_df = pd.read_sql('SELECT * FROM sales_table', engine)
Decision Framework and Conclusion
| Factor | Pandas | PySpark | SQL |
|---|---|---|---|
| Data Size | < 10GB | > 10GB or distributed | Any (database handles it) |
| Setup Complexity | Minimal | High (cluster required) | Minimal (if DB exists) |
| Custom Aggregations | Excellent | Possible but slow | Limited |
| Debugging | Easy | Harder (distributed) | Medium |
| Best For | Exploration, prototyping | Production pipelines, big data | Data already in DB |
Choose Pandas when you’re exploring data interactively, building prototypes, or working with datasets that fit in memory. The development speed advantage is significant.
Choose PySpark when your data exceeds memory limits, you need to process data in a distributed environment, or you’re building production pipelines that must scale.
Choose SQL when your data lives in a database and you want to minimize data movement. SQL is also the most portable skill—it works everywhere.
Don’t overcomplicate this. Start with the simplest tool that handles your data size, and only add complexity when you hit real limitations.