Pandas GroupBy - Complete Guide with Examples

Key Insights

• GroupBy operations split data into groups, apply functions, and combine results—understanding this split-apply-combine pattern is essential for efficient data analysis • Aggregation, transformation, and filtration are three distinct GroupBy operations that serve different purposes and return different shaped outputs • Custom aggregation functions and named aggregations provide flexibility beyond built-in methods while maintaining readable, maintainable code

Understanding the Split-Apply-Combine Pattern

Pandas GroupBy implements the split-apply-combine strategy for data analysis. The operation splits your DataFrame into groups based on specified criteria, applies a function to each group independently, and combines the results into a new data structure.

import pandas as pd
import numpy as np

# Sample sales data
data = {
    'region': ['North', 'South', 'North', 'East', 'South', 'East', 'North', 'South'],
    'product': ['A', 'B', 'A', 'B', 'A', 'A', 'B', 'B'],
    'sales': [100, 150, 120, 200, 130, 180, 140, 160],
    'units': [10, 15, 12, 20, 13, 18, 14, 16]
}
df = pd.DataFrame(data)

# Basic groupby operation
grouped = df.groupby('region')
print(grouped.groups)
# {'East': [3, 5], 'North': [0, 2, 6], 'South': [1, 4, 7]}

The grouped object is a DataFrameGroupBy instance that hasn’t computed anything yet—it’s lazy. Computation happens when you apply an aggregation or transformation method.

Aggregation Operations

Aggregation reduces each group to a single row. Common aggregation functions include sum(), mean(), count(), min(), max(), and std().

# Single column aggregation
region_sales = df.groupby('region')['sales'].sum()
print(region_sales)
# region
# East     380
# North    360
# South    440

# Multiple aggregations on one column
sales_stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print(sales_stats)
#         sum   mean  count
# region                   
# East    380  190.0      2
# North   360  120.0      3
# South   440  146.7      3

# Different aggregations for different columns
multi_agg = df.groupby('region').agg({
    'sales': ['sum', 'mean'],
    'units': ['sum', 'max']
})
print(multi_agg)

Named aggregations provide cleaner column names and improve code readability:

result = df.groupby('region').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_units=('units', 'sum'),
    max_units=('units', 'max')
)
print(result)
#        total_sales  avg_sales  total_units  max_units
# region                                                
# East           380      190.0           38         20
# North          360      120.0           36         14
# South          440      146.7           44         16

Multi-Level Grouping

Group by multiple columns to create hierarchical indexes for more granular analysis:

# Group by region and product
multi_group = df.groupby(['region', 'product'])['sales'].sum()
print(multi_group)
# region  product
# East    A          180
#         B          200
# North   A          220
#         B          140
# South   A          130
#         B          310

# Reset index for flat DataFrame
flat_result = multi_group.reset_index()
print(flat_result)
#   region product  sales
# 0   East       A    180
# 1   East       B    200
# 2  North       A    220
# 3  North       B    140
# 4  South       A    130
# 5  South       B    310

# Access specific group
north_data = df.groupby('region').get_group('North')
print(north_data)

Transformation Operations

Transformations return an object with the same shape as the input, making them useful for operations like standardization or filling missing values with group-specific statistics:

# Calculate z-scores within each region
df['sales_zscore'] = df.groupby('region')['sales'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Fill missing values with group mean
df_with_nulls = df.copy()
df_with_nulls.loc[2, 'sales'] = np.nan

df_with_nulls['sales_filled'] = df_with_nulls.groupby('region')['sales'].transform(
    lambda x: x.fillna(x.mean())
)

# Calculate percentage of group total
df['pct_of_region'] = df.groupby('region')['sales'].transform(
    lambda x: x / x.sum() * 100
)
print(df[['region', 'sales', 'pct_of_region']])

Filtration Operations

Filter groups based on group-level conditions rather than row-level conditions:

# Keep only regions with total sales > 400
high_sales_regions = df.groupby('region').filter(lambda x: x['sales'].sum() > 400)
print(high_sales_regions)

# Keep groups with more than 2 records
large_groups = df.groupby('region').filter(lambda x: len(x) > 2)

# Combine conditions
filtered = df.groupby('region').filter(
    lambda x: (x['sales'].sum() > 350) and (x['sales'].mean() > 120)
)

Custom Aggregation Functions

Define custom functions for complex aggregations that built-in methods don’t cover:

def sales_range(series):
    return series.max() - series.min()

def weighted_avg(group):
    return np.average(group['sales'], weights=group['units'])

# Apply custom function
custom_agg = df.groupby('region').agg({
    'sales': [sales_range, 'sum'],
    'units': 'sum'
})

# Apply function to entire group
weighted_avgs = df.groupby('region').apply(weighted_avg)
print(weighted_avgs)

# Multiple custom calculations
def group_stats(group):
    return pd.Series({
        'total_revenue': group['sales'].sum(),
        'avg_price': group['sales'].sum() / group['units'].sum(),
        'record_count': len(group)
    })

detailed_stats = df.groupby('region').apply(group_stats)
print(detailed_stats)

Iterating Over Groups

Direct iteration provides maximum flexibility for complex operations:

for name, group in df.groupby('region'):
    print(f"\n{name}:")
    print(group)
    print(f"Total sales: {group['sales'].sum()}")

# Store groups in dictionary
groups_dict = {name: group for name, group in df.groupby('region')}
north_df = groups_dict['North']

# Process groups conditionally
results = []
for region, group in df.groupby('region'):
    if group['sales'].mean() > 140:
        results.append({
            'region': region,
            'top_product': group.loc[group['sales'].idxmax(), 'product'],
            'top_sales': group['sales'].max()
        })

results_df = pd.DataFrame(results)

Performance Optimization

GroupBy operations can be memory-intensive. Use these techniques for better performance:

# Use categorical data for grouping columns
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')

# Avoid apply() when possible - use built-in methods
# Slow
slow_result = df.groupby('region').apply(lambda x: x['sales'].sum())

# Fast
fast_result = df.groupby('region')['sales'].sum()

# Use observed=True for categorical grouping to exclude unobserved categories
df.groupby('region', observed=True)['sales'].sum()

# Sort before grouping for better performance on large datasets
df_sorted = df.sort_values('region')
df_sorted.groupby('region')['sales'].sum()

Handling Edge Cases

# Empty groups
empty_df = df[df['sales'] > 1000]  # No matches
result = empty_df.groupby('region')['sales'].sum()  # Returns empty Series

# Single group
single_region = df[df['region'] == 'North'].groupby('region')['sales'].mean()

# Missing values in grouping column
df_with_na = df.copy()
df_with_na.loc[0, 'region'] = np.nan
# By default, NaN values are excluded from groups
result = df_with_na.groupby('region', dropna=True)['sales'].sum()
# Include NaN as a group
result_with_na = df_with_na.groupby('region', dropna=False)['sales'].sum()

GroupBy operations form the backbone of data analysis workflows in Pandas. Master these patterns—aggregation for summaries, transformation for feature engineering, and filtration for subset selection—and you’ll handle most analytical tasks efficiently. Always profile your code when working with large datasets, as the choice between apply(), built-in methods, and custom functions significantly impacts performance.