Pandas - GroupBy and Aggregate (agg)
GroupBy operations follow a split-apply-combine pattern. Pandas splits your DataFrame into groups based on one or more keys, applies a function to each group, and combines the results.
Key Insights
- GroupBy splits data into groups based on specified columns, applies functions to each group independently, and combines results into a structured output—understanding this split-apply-combine pattern is fundamental to efficient data aggregation in Pandas.
- The
agg()function provides flexible aggregation with multiple operations per column, named aggregations for clean output, and support for both built-in functions and custom lambdas, eliminating the need for verbose manual operations. - Performance optimization requires strategic choices: use built-in aggregation methods over custom functions when possible, leverage
transform()for broadcasting results back to original DataFrame shape, and applyfilter()to exclude entire groups based on conditions.
Understanding the GroupBy Mechanism
GroupBy operations follow a split-apply-combine pattern. Pandas splits your DataFrame into groups based on one or more keys, applies a function to each group, and combines the results.
import pandas as pd
import numpy as np
# Sample sales data
data = {
'region': ['North', 'North', 'South', 'South', 'East', 'East'],
'product': ['A', 'B', 'A', 'B', 'A', 'B'],
'sales': [100, 150, 200, 120, 180, 90],
'quantity': [10, 15, 20, 12, 18, 9]
}
df = pd.DataFrame(data)
# Basic groupby with single aggregation
result = df.groupby('region')['sales'].sum()
print(result)
# region
# East 270
# North 250
# South 320
The GroupBy object is lazy—it doesn’t compute anything until you call an aggregation method. This allows chaining operations efficiently.
# Multiple columns as grouping keys
grouped = df.groupby(['region', 'product'])
print(grouped.size())
# region product
# East A 1
# B 1
# North A 1
# B 1
# South A 1
# B 1
Single and Multiple Aggregations with agg()
The agg() method accepts various input formats. For single aggregations, pass a function name or callable.
# Single aggregation function
result = df.groupby('region').agg('sum')
print(result)
# sales quantity
# region
# East 270 27
# North 250 25
# South 320 32
# Multiple aggregation functions
result = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print(result)
# sum mean count
# region
# East 270 135.0 2
# North 250 125.0 2
# South 320 160.0 2
Apply different aggregations to different columns using a dictionary.
# Different functions for different columns
agg_dict = {
'sales': ['sum', 'mean'],
'quantity': ['min', 'max']
}
result = df.groupby('region').agg(agg_dict)
print(result)
# sales quantity
# sum mean min max
# region
# East 270 135.0 9 18
# North 250 125.0 10 15
# South 320 160.0 12 20
Named Aggregations for Clean Output
Named aggregations provide cleaner column names and better readability. Use pd.NamedAgg or the tuple syntax.
# Named aggregations (Pandas 0.25+)
result = df.groupby('region').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean'),
max_quantity=('quantity', 'max'),
min_quantity=('quantity', 'min')
)
print(result)
# total_sales avg_sales max_quantity min_quantity
# region
# East 270 135.0 18 9
# North 250 125.0 15 10
# South 320 160.0 20 12
# Access results directly
print(result['total_sales'])
This approach eliminates MultiIndex columns and makes downstream processing simpler.
Custom Aggregation Functions
Use lambda functions or custom callables for complex aggregations.
# Custom aggregation with lambda
result = df.groupby('region').agg(
sales_range=('sales', lambda x: x.max() - x.min()),
weighted_avg=('sales', lambda x: np.average(x, weights=df.loc[x.index, 'quantity']))
)
print(result)
# sales_range weighted_avg
# region
# East 90 126.666667
# North 50 130.000000
# South 80 163.636364
# Reusable custom function
def coefficient_of_variation(x):
return x.std() / x.mean() if x.mean() != 0 else 0
result = df.groupby('region').agg(
sales_cv=('sales', coefficient_of_variation),
qty_cv=('quantity', coefficient_of_variation)
)
print(result)
Custom functions execute slower than built-in methods. Use NumPy functions when possible for better performance.
Working with Multiple GroupBy Levels
Hierarchical grouping creates MultiIndex results. Reset or manipulate indexes as needed.
# Multi-level groupby
result = df.groupby(['region', 'product']).agg(
total_sales=('sales', 'sum'),
total_qty=('quantity', 'sum')
)
print(result)
# total_sales total_qty
# region product
# East A 180 18
# B 90 9
# North A 100 10
# B 150 15
# South A 200 20
# B 120 12
# Flatten MultiIndex
result_flat = result.reset_index()
print(result_flat)
# region product total_sales total_qty
# 0 East A 180 18
# 1 East B 90 9
# 2 North A 100 10
# 3 North B 150 15
# 4 South A 200 20
# 5 South B 120 12
# Access specific level
result_by_product = result.groupby(level='product').sum()
print(result_by_product)
Transform vs Aggregate
While agg() reduces groups to summary statistics, transform() broadcasts results back to the original DataFrame shape.
# Add group statistics to original DataFrame
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
df['pct_of_region_sales'] = df['sales'] / df['region_avg_sales'] * 100
print(df)
# region product sales quantity region_avg_sales pct_of_region_sales
# 0 North A 100 10 125.0 80.000000
# 1 North B 150 15 125.0 120.000000
# 2 South A 200 20 160.0 125.000000
# 3 South B 120 12 160.0 75.000000
# 4 East A 180 18 135.0 133.333333
# 5 East B 90 9 135.0 66.666667
Transform maintains the original index and row count, making it ideal for feature engineering.
Filtering Groups
Use filter() to exclude entire groups based on conditions.
# Keep only regions with total sales > 260
filtered = df.groupby('region').filter(lambda x: x['sales'].sum() > 260)
print(filtered)
# region product sales quantity
# 2 South A 200 20
# 3 South B 120 12
# 4 East A 180 18
# 5 East B 90 9
# Combine filter with aggregation
result = (df.groupby('region')
.filter(lambda x: x['sales'].sum() > 260)
.groupby('region')
.agg(total_sales=('sales', 'sum')))
print(result)
Performance Considerations
Built-in aggregation methods are optimized in C and significantly faster than custom functions.
# Efficient: built-in function
%timeit df.groupby('region')['sales'].sum()
# Less efficient: lambda
%timeit df.groupby('region')['sales'].agg(lambda x: x.sum())
# For multiple operations, compute once
result = df.groupby('region')['sales'].agg([
'sum', 'mean', 'std', 'min', 'max'
])
# Avoid repeated groupby operations
grouped = df.groupby('region')
sales_sum = grouped['sales'].sum()
sales_mean = grouped['sales'].mean()
For large datasets, use observed=True when grouping categorical data to exclude unobserved categories.
df['region'] = df['region'].astype('category')
result = df.groupby('region', observed=True).agg({'sales': 'sum'})
Practical Example: Sales Analysis
# Comprehensive sales analysis
sales_data = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'product': np.random.choice(['A', 'B', 'C'], 100),
'sales': np.random.randint(50, 500, 100),
'cost': np.random.randint(20, 200, 100)
})
sales_data['month'] = sales_data['date'].dt.to_period('M')
# Multi-dimensional analysis
analysis = sales_data.groupby(['month', 'region']).agg(
total_revenue=('sales', 'sum'),
total_cost=('cost', 'sum'),
avg_transaction=('sales', 'mean'),
transaction_count=('sales', 'count'),
revenue_std=('sales', 'std')
)
analysis['profit_margin'] = ((analysis['total_revenue'] - analysis['total_cost']) /
analysis['total_revenue'] * 100)
analysis['cv'] = analysis['revenue_std'] / analysis['avg_transaction']
print(analysis.head(10))
This pattern handles real-world analytics requirements efficiently, combining multiple aggregation types with calculated fields.