Pandas: GroupBy with DataFrames
The GroupBy operation is one of the most powerful features in pandas, yet many developers underutilize it or misuse it entirely. At its core, GroupBy implements the split-apply-combine paradigm: you...
Key Insights
- GroupBy follows the split-apply-combine pattern: split data into groups, apply a function to each group, then combine results back into a single structure
- Use
agg()for reducing groups to single values,transform()for broadcasting results back to original shape, andapply()for complex operations that don’t fit either pattern - Converting groupby columns to categorical dtype before grouping can dramatically improve performance on large datasets with repeated values
Introduction to GroupBy Operations
The GroupBy operation is one of the most powerful features in pandas, yet many developers underutilize it or misuse it entirely. At its core, GroupBy implements the split-apply-combine paradigm: you split your data into groups based on some criteria, apply a function to each group independently, then combine the results into a new data structure.
This pattern appears constantly in real-world data analysis. You need sales totals by region. Average response times by endpoint. Customer counts by signup month. Any time you’re asking “what’s the X for each Y,” you’re thinking in GroupBy terms.
The alternative—writing loops to filter and aggregate manually—is slower, more error-prone, and harder to read. Master GroupBy and you’ll write cleaner, faster pandas code.
Basic GroupBy Syntax and Single-Column Grouping
Let’s start with a practical example. You have sales data and need to analyze it by region.
import pandas as pd
import numpy as np
# Sample sales data
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'East', 'South', 'East', 'North', 'South'],
'product': ['Widget', 'Widget', 'Gadget', 'Widget', 'Gadget', 'Gadget', 'Widget', 'Widget'],
'sales': [250, 150, 300, 275, 225, 180, 320, 195],
'quantity': [10, 8, 12, 11, 9, 7, 13, 10]
})
# Basic groupby with single aggregation
region_totals = df.groupby('region')['sales'].sum()
print(region_totals)
# region
# East 455
# North 870
# South 570
# Name: sales, dtype: int64
# Multiple aggregations on single column
region_stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print(region_stats)
# sum mean count
# region
# East 455 227.50 2
# North 870 290.00 3
# South 570 190.00 3
Notice that groupby() returns a GroupBy object, not a DataFrame. The aggregation happens when you call a method like sum() or agg(). This lazy evaluation means pandas doesn’t do unnecessary work if you chain operations.
The result has the groupby column as the index by default. If you want it as a regular column, pass as_index=False or call reset_index() afterward.
Multi-Column Grouping
Real analysis often requires grouping by multiple dimensions simultaneously. Pass a list of column names to create hierarchical groups.
# Group by region AND product
multi_group = df.groupby(['region', 'product'])['sales'].sum()
print(multi_group)
# region product
# East Gadget 180
# Widget 275
# North Gadget 300
# Widget 570
# South Gadget 225
# Widget 345
# Name: sales, dtype: int64
# Get a DataFrame instead of Series
multi_group_df = df.groupby(['region', 'product'], as_index=False)['sales'].sum()
print(multi_group_df)
# region product sales
# 0 East Gadget 180
# 1 East Widget 275
# 2 North Gadget 300
# 3 North Widget 570
# 4 South Gadget 225
# 5 South Widget 345
# Unstack to create a pivot-like view
pivoted = df.groupby(['region', 'product'])['sales'].sum().unstack(fill_value=0)
print(pivoted)
# product Gadget Widget
# region
# East 180 275
# North 300 570
# South 225 345
The multi-index result can be powerful for hierarchical analysis, but it’s often easier to work with a flat DataFrame. Use as_index=False when you need to continue manipulating the results.
Aggregation Functions: agg, transform, and apply
Understanding the difference between these three methods is crucial. They serve fundamentally different purposes.
agg() reduces each group to a single row. Use it when you want summary statistics.
# Multiple aggregations on multiple columns
summary = df.groupby('region').agg({
'sales': ['sum', 'mean', 'std'],
'quantity': ['sum', 'mean']
})
print(summary)
# sales quantity
# sum mean std sum mean
# region
# East 455 227.50 67.175144 18 9.0
# North 870 290.00 36.055513 35 11.67
# South 570 190.00 37.749172 27 9.0
transform() returns a result with the same shape as the input. Each row gets the aggregated value of its group. This is perfect for adding group-level statistics back to individual rows.
# Add group mean to each row
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
print(df[['region', 'sales', 'region_avg_sales']])
# region sales region_avg_sales
# 0 North 250 290.00
# 1 South 150 190.00
# 2 North 300 290.00
# 3 East 275 227.50
# 4 South 225 190.00
# 5 East 180 227.50
# 6 North 320 290.00
# 7 South 195 190.00
# Calculate percentage of group total
df['pct_of_region'] = df['sales'] / df.groupby('region')['sales'].transform('sum') * 100
apply() is the most flexible but slowest. Use it when you need operations that don’t fit agg() or transform().
# Custom function that returns multiple rows per group
def top_n_sales(group, n=1):
return group.nlargest(n, 'sales')
top_by_region = df.groupby('region').apply(top_n_sales, include_groups=False)
print(top_by_region)
Named Aggregations and Custom Functions
Pandas 0.25 introduced named aggregations, which produce cleaner column names and allow mixing built-in and custom functions.
# Named aggregation syntax
result = df.groupby('region').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean'),
total_quantity=('quantity', 'sum'),
transaction_count=('sales', 'count')
)
print(result)
# total_sales avg_sales total_quantity transaction_count
# region
# East 455 227.50 18 2
# North 870 290.00 35 3
# South 570 190.00 27 3
# Custom functions with named aggregations
def coefficient_of_variation(x):
return x.std() / x.mean() * 100
result = df.groupby('region').agg(
total_sales=('sales', 'sum'),
sales_cv=('sales', coefficient_of_variation),
sales_range=('sales', lambda x: x.max() - x.min())
)
print(result)
Named aggregations are clearer than the dictionary syntax and should be your default choice for complex aggregations.
Filtering Groups and Handling Missing Data
Sometimes you need to filter out entire groups based on group-level criteria. The filter() method handles this elegantly.
# Keep only regions with more than 2 transactions
filtered = df.groupby('region').filter(lambda x: len(x) > 2)
print(filtered)
# region product sales quantity
# 0 North Widget 250 10
# 2 North Gadget 300 12
# 6 North Widget 320 13
# 1 South Widget 150 8
# 4 South Gadget 225 9
# 7 South Widget 195 10
# Keep groups where average sales exceed threshold
high_performers = df.groupby('region').filter(lambda x: x['sales'].mean() > 200)
For missing data, GroupBy excludes NaN values from groupby columns by default. Control this with the dropna parameter.
# Sample data with missing values
df_with_nan = df.copy()
df_with_nan.loc[0, 'region'] = np.nan
# Default: excludes NaN groups
print(df_with_nan.groupby('region')['sales'].sum())
# Include NaN as a group
print(df_with_nan.groupby('region', dropna=False)['sales'].sum())
Performance Tips and Common Pitfalls
GroupBy performance matters when you’re working with millions of rows. Here are concrete optimizations.
Convert to categorical before grouping. This is the single biggest performance win for columns with repeated values.
import time
# Create larger dataset for benchmarking
large_df = pd.DataFrame({
'region': np.random.choice(['North', 'South', 'East', 'West'], size=1_000_000),
'sales': np.random.randint(100, 1000, size=1_000_000)
})
# Without categorical
start = time.time()
result1 = large_df.groupby('region')['sales'].sum()
print(f"String groupby: {time.time() - start:.3f}s")
# With categorical
large_df['region_cat'] = large_df['region'].astype('category')
start = time.time()
result2 = large_df.groupby('region_cat')['sales'].sum()
print(f"Categorical groupby: {time.time() - start:.3f}s")
# Typically 2-5x faster
Avoid apply() when possible. It’s the slowest GroupBy method because it can’t be vectorized. If you can express your operation with agg() or transform(), do so.
Use built-in aggregation functions. String names like 'sum' and 'mean' are optimized in Cython. Lambda functions force slower Python execution.
# Fast: uses optimized Cython code
df.groupby('region')['sales'].agg('sum')
# Slower: Python function call overhead
df.groupby('region')['sales'].agg(lambda x: x.sum())
Reset index strategically. If you’re chaining operations, keeping the grouped index can be useful. If you’re done aggregating, reset_index() or as_index=False gives you a cleaner DataFrame to work with.
GroupBy is fundamental to pandas data analysis. Learn these patterns, understand when to use each method, and your data manipulation code will be cleaner, faster, and more maintainable.