How to GroupBy and Aggregate in Pandas
Pandas GroupBy is one of the most powerful features for data analysis, yet many developers underutilize it or struggle with its syntax. At its core, GroupBy implements the split-apply-combine...
Key Insights
- GroupBy follows the split-apply-combine paradigm: split your data into groups, apply a function to each group, then combine the results back into a single structure.
- Use
.agg()with dictionaries or named aggregations when you need different functions for different columns, and reserve.apply()for complex logic that built-in methods can’t handle. - Choose
transform()overaggregate()when you need results broadcast back to the original DataFrame shape, such as calculating group-relative values or filling missing data with group statistics.
Introduction to GroupBy Operations
Pandas GroupBy is one of the most powerful features for data analysis, yet many developers underutilize it or struggle with its syntax. At its core, GroupBy implements the split-apply-combine pattern: you split your data into groups based on some criteria, apply a function to each group independently, and combine the results into a new data structure.
You’ll reach for GroupBy whenever you need to answer questions like “What’s the average sales per region?” or “How many transactions did each customer make?” These are aggregation problems, and GroupBy handles them elegantly.
Let’s work with a realistic dataset throughout this article:
import pandas as pd
import numpy as np
# Sample sales data
df = pd.DataFrame({
'region': ['East', 'West', 'East', 'West', 'East', 'West', 'East', 'West'],
'product': ['A', 'A', 'B', 'B', 'A', 'B', 'B', 'A'],
'salesperson': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Alice', 'Bob'],
'revenue': [1200, 1500, 800, 950, 1100, 1300, 750, 1400],
'units': [10, 12, 8, 9, 11, 13, 7, 14],
'returns': [1, 0, 2, 1, 0, 1, 1, 0]
})
Basic GroupBy Syntax
The fundamental GroupBy operation takes a column name (or list of columns) and returns a DataFrameGroupBy object. This object is lazy—it doesn’t compute anything until you call an aggregation method.
# Group by a single column
grouped = df.groupby('region')
# Apply aggregation
revenue_by_region = df.groupby('region')['revenue'].sum()
print(revenue_by_region)
# region
# East 3850
# West 5150
# Name: revenue, dtype: int64
# Multiple aggregations on the same column
revenue_stats = df.groupby('region')['revenue'].agg(['sum', 'mean', 'count'])
print(revenue_stats)
# sum mean count
# region
# East 3850 962.50 4
# West 5150 1287.50 4
Grouping by multiple columns creates a hierarchical index:
# Group by multiple columns
by_region_product = df.groupby(['region', 'product'])['revenue'].sum()
print(by_region_product)
# region product
# East A 2300
# B 1550
# West A 2900
# B 2250
# Name: revenue, dtype: int64
# Access all columns after groupby
full_summary = df.groupby(['region', 'product']).sum()
print(full_summary)
You can also iterate over groups when you need to inspect or process them individually:
# Iterate over groups
for name, group in df.groupby('region'):
print(f"Region: {name}")
print(group)
print()
# Access a specific group
east_data = df.groupby('region').get_group('East')
Built-in Aggregation Functions
Pandas provides optimized implementations for common aggregations. These are faster than equivalent custom functions and should be your first choice:
| Function | Description |
|---|---|
sum() |
Sum of values |
mean() |
Arithmetic mean |
median() |
Median value |
count() |
Count of non-null values |
size() |
Count including nulls |
min() / max() |
Minimum / Maximum |
std() / var() |
Standard deviation / Variance |
first() / last() |
First / Last value |
nunique() |
Count of unique values |
# Chain multiple aggregations
summary = df.groupby('region').agg({
'revenue': ['sum', 'mean', 'std'],
'units': ['sum', 'mean'],
'returns': 'sum'
})
print(summary)
# revenue units returns
# sum mean std sum mean sum
# region
# East 3850 962.50 207.530836 36 9.0 4
# West 5150 1287.50 238.048314 48 12.0 2
Custom Aggregations with agg()
The .agg() method becomes essential when you need different aggregations for different columns or want to rename the output columns.
The dictionary syntax lets you specify which function to apply to each column:
# Different functions for different columns
result = df.groupby('region').agg({
'revenue': 'sum',
'units': 'mean',
'returns': 'max'
})
print(result)
# revenue units returns
# region
# East 3850 9.0 2
# West 5150 12.0 1
For cleaner output with custom column names, use named aggregations:
# Named aggregations (pandas 0.25+)
result = df.groupby('region').agg(
total_revenue=pd.NamedAgg(column='revenue', aggfunc='sum'),
avg_units=pd.NamedAgg(column='units', aggfunc='mean'),
total_returns=pd.NamedAgg(column='returns', aggfunc='sum'),
transaction_count=pd.NamedAgg(column='revenue', aggfunc='count')
)
print(result)
# total_revenue avg_units total_returns transaction_count
# region
# East 3850 9.0 4 4
# West 5150 12.0 2 4
# Shorthand syntax (equivalent)
result = df.groupby('region').agg(
total_revenue=('revenue', 'sum'),
avg_units=('units', 'mean'),
total_returns=('returns', 'sum')
)
Applying Custom Functions
When built-in aggregations aren’t enough, you can pass custom functions to .agg() or use .apply() for more complex logic.
Lambda functions work well for simple calculations:
# Lambda for custom aggregation
result = df.groupby('region').agg({
'revenue': lambda x: x.sum() / 1000, # Revenue in thousands
'units': lambda x: x.max() - x.min(), # Range of units
'returns': lambda x: (x > 0).sum() # Count of rows with returns
})
print(result)
# Multiple custom functions
result = df.groupby('region')['revenue'].agg([
('total', 'sum'),
('average', 'mean'),
('range', lambda x: x.max() - x.min())
])
For complex logic involving multiple columns, use .apply() with a custom function:
# Custom function that uses multiple columns
def revenue_per_unit_stats(group):
rpu = group['revenue'] / group['units']
return pd.Series({
'avg_revenue_per_unit': rpu.mean(),
'min_revenue_per_unit': rpu.min(),
'max_revenue_per_unit': rpu.max()
})
result = df.groupby('region').apply(revenue_per_unit_stats)
print(result)
# Calculate return rate as percentage of units
def calculate_metrics(group):
return pd.Series({
'total_revenue': group['revenue'].sum(),
'return_rate': (group['returns'].sum() / group['units'].sum()) * 100,
'avg_order_value': group['revenue'].mean()
})
metrics = df.groupby('region').apply(calculate_metrics)
Performance note: .apply() is slower than built-in methods because it can’t be vectorized. Use it only when necessary.
Transform vs Aggregate
This distinction trips up many developers. Aggregation reduces your data—you get one row per group. Transform returns data in the same shape as the input, broadcasting group-level results back to each row.
Use transform() when you need to add group statistics back to your original DataFrame:
# Add group mean as a new column
df['region_avg_revenue'] = df.groupby('region')['revenue'].transform('mean')
print(df[['region', 'revenue', 'region_avg_revenue']])
# region revenue region_avg_revenue
# 0 East 1200 962.50
# 1 West 1500 1287.50
# 2 East 800 962.50
# ...
# Calculate percentage of group total
df['pct_of_region'] = df['revenue'] / df.groupby('region')['revenue'].transform('sum') * 100
# Normalize within groups (z-score)
df['revenue_zscore'] = df.groupby('region')['revenue'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Fill missing values with group mean
df_with_nulls = df.copy()
df_with_nulls.loc[0, 'revenue'] = np.nan
df_with_nulls['revenue_filled'] = df_with_nulls.groupby('region')['revenue'].transform(
lambda x: x.fillna(x.mean())
)
Practical Tips and Performance
Reset the index when you want a flat DataFrame instead of a hierarchical index:
# Grouped result with MultiIndex
result = df.groupby(['region', 'product'])['revenue'].sum()
print(result.index) # MultiIndex
# Reset to flat DataFrame
result_flat = df.groupby(['region', 'product'])['revenue'].sum().reset_index()
# Or use as_index=False
result_flat = df.groupby(['region', 'product'], as_index=False)['revenue'].sum()
Handle NaN values explicitly:
# By default, groupby excludes NaN keys
df_with_nan = df.copy()
df_with_nan.loc[0, 'region'] = np.nan
# Include NaN as a group (pandas 1.1+)
result = df_with_nan.groupby('region', dropna=False)['revenue'].sum()
Performance considerations for large datasets:
# Use categorical dtype for groupby columns
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')
# Avoid apply() when possible - use built-in methods
# Slow:
df.groupby('region')['revenue'].apply(lambda x: x.sum())
# Fast:
df.groupby('region')['revenue'].sum()
# Use observed=True with categoricals to skip empty combinations
df.groupby(['region', 'product'], observed=True)['revenue'].sum()
# For very large datasets, consider sort=False if order doesn't matter
df.groupby('region', sort=False)['revenue'].sum()
GroupBy operations are fundamental to data analysis in Pandas. Master the distinction between agg(), apply(), and transform(), prefer built-in aggregation functions for performance, and use named aggregations for readable code. These patterns will handle the vast majority of your aggregation needs.