How to GroupBy and Aggregate in Pandas

Pandas GroupBy is one of the most powerful features for data analysis, yet many developers underutilize it or struggle with its syntax. At its core, GroupBy implements the split-apply-combine...

Key Insights

  • GroupBy follows the split-apply-combine paradigm: split your data into groups, apply a function to each group, then combine the results back into a single structure.
  • Use .agg() with dictionaries or named aggregations when you need different functions for different columns, and reserve .apply() for complex logic that built-in methods can’t handle.
  • Choose transform() over aggregate() when you need results broadcast back to the original DataFrame shape, such as calculating group-relative values or filling missing data with group statistics.

Introduction to GroupBy Operations

Pandas GroupBy is one of the most powerful features for data analysis, yet many developers underutilize it or struggle with its syntax. At its core, GroupBy implements the split-apply-combine pattern: you split your data into groups based on some criteria, apply a function to each group independently, and combine the results into a new data structure.

You’ll reach for GroupBy whenever you need to answer questions like “What’s the average sales per region?” or “How many transactions did each customer make?” These are aggregation problems, and GroupBy handles them elegantly.

Let’s work with a realistic dataset throughout this article:

import pandas as pd
import numpy as np

# Sample sales data
df = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West', 'East', 'West', 'East', 'West'],
    'product': ['A', 'A', 'B', 'B', 'A', 'B', 'B', 'A'],
    'salesperson': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Alice', 'Bob'],
    'revenue': [1200, 1500, 800, 950, 1100, 1300, 750, 1400],
    'units': [10, 12, 8, 9, 11, 13, 7, 14],
    'returns': [1, 0, 2, 1, 0, 1, 1, 0]
})

Basic GroupBy Syntax

The fundamental GroupBy operation takes a column name (or list of columns) and returns a DataFrameGroupBy object. This object is lazy—it doesn’t compute anything until you call an aggregation method.

# Group by a single column
grouped = df.groupby('region')

# Apply aggregation
revenue_by_region = df.groupby('region')['revenue'].sum()
print(revenue_by_region)
# region
# East    3850
# West    5150
# Name: revenue, dtype: int64

# Multiple aggregations on the same column
revenue_stats = df.groupby('region')['revenue'].agg(['sum', 'mean', 'count'])
print(revenue_stats)
#         sum     mean  count
# region                      
# East   3850   962.50      4
# West   5150  1287.50      4

Grouping by multiple columns creates a hierarchical index:

# Group by multiple columns
by_region_product = df.groupby(['region', 'product'])['revenue'].sum()
print(by_region_product)
# region  product
# East    A          2300
#         B          1550
# West    A          2900
#         B          2250
# Name: revenue, dtype: int64

# Access all columns after groupby
full_summary = df.groupby(['region', 'product']).sum()
print(full_summary)

You can also iterate over groups when you need to inspect or process them individually:

# Iterate over groups
for name, group in df.groupby('region'):
    print(f"Region: {name}")
    print(group)
    print()

# Access a specific group
east_data = df.groupby('region').get_group('East')

Built-in Aggregation Functions

Pandas provides optimized implementations for common aggregations. These are faster than equivalent custom functions and should be your first choice:

Function Description
sum() Sum of values
mean() Arithmetic mean
median() Median value
count() Count of non-null values
size() Count including nulls
min() / max() Minimum / Maximum
std() / var() Standard deviation / Variance
first() / last() First / Last value
nunique() Count of unique values
# Chain multiple aggregations
summary = df.groupby('region').agg({
    'revenue': ['sum', 'mean', 'std'],
    'units': ['sum', 'mean'],
    'returns': 'sum'
})
print(summary)
#        revenue                      units        returns
#            sum     mean         std   sum  mean     sum
# region                                                  
# East      3850   962.50  207.530836    36   9.0       4
# West      5150  1287.50  238.048314    48  12.0       2

Custom Aggregations with agg()

The .agg() method becomes essential when you need different aggregations for different columns or want to rename the output columns.

The dictionary syntax lets you specify which function to apply to each column:

# Different functions for different columns
result = df.groupby('region').agg({
    'revenue': 'sum',
    'units': 'mean',
    'returns': 'max'
})
print(result)
#         revenue  units  returns
# region                         
# East       3850    9.0        2
# West       5150   12.0        1

For cleaner output with custom column names, use named aggregations:

# Named aggregations (pandas 0.25+)
result = df.groupby('region').agg(
    total_revenue=pd.NamedAgg(column='revenue', aggfunc='sum'),
    avg_units=pd.NamedAgg(column='units', aggfunc='mean'),
    total_returns=pd.NamedAgg(column='returns', aggfunc='sum'),
    transaction_count=pd.NamedAgg(column='revenue', aggfunc='count')
)
print(result)
#         total_revenue  avg_units  total_returns  transaction_count
# region                                                             
# East             3850        9.0              4                  4
# West             5150       12.0              2                  4

# Shorthand syntax (equivalent)
result = df.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_units=('units', 'mean'),
    total_returns=('returns', 'sum')
)

Applying Custom Functions

When built-in aggregations aren’t enough, you can pass custom functions to .agg() or use .apply() for more complex logic.

Lambda functions work well for simple calculations:

# Lambda for custom aggregation
result = df.groupby('region').agg({
    'revenue': lambda x: x.sum() / 1000,  # Revenue in thousands
    'units': lambda x: x.max() - x.min(),  # Range of units
    'returns': lambda x: (x > 0).sum()     # Count of rows with returns
})
print(result)

# Multiple custom functions
result = df.groupby('region')['revenue'].agg([
    ('total', 'sum'),
    ('average', 'mean'),
    ('range', lambda x: x.max() - x.min())
])

For complex logic involving multiple columns, use .apply() with a custom function:

# Custom function that uses multiple columns
def revenue_per_unit_stats(group):
    rpu = group['revenue'] / group['units']
    return pd.Series({
        'avg_revenue_per_unit': rpu.mean(),
        'min_revenue_per_unit': rpu.min(),
        'max_revenue_per_unit': rpu.max()
    })

result = df.groupby('region').apply(revenue_per_unit_stats)
print(result)

# Calculate return rate as percentage of units
def calculate_metrics(group):
    return pd.Series({
        'total_revenue': group['revenue'].sum(),
        'return_rate': (group['returns'].sum() / group['units'].sum()) * 100,
        'avg_order_value': group['revenue'].mean()
    })

metrics = df.groupby('region').apply(calculate_metrics)

Performance note: .apply() is slower than built-in methods because it can’t be vectorized. Use it only when necessary.

Transform vs Aggregate

This distinction trips up many developers. Aggregation reduces your data—you get one row per group. Transform returns data in the same shape as the input, broadcasting group-level results back to each row.

Use transform() when you need to add group statistics back to your original DataFrame:

# Add group mean as a new column
df['region_avg_revenue'] = df.groupby('region')['revenue'].transform('mean')
print(df[['region', 'revenue', 'region_avg_revenue']])
#   region  revenue  region_avg_revenue
# 0   East     1200              962.50
# 1   West     1500             1287.50
# 2   East      800              962.50
# ...

# Calculate percentage of group total
df['pct_of_region'] = df['revenue'] / df.groupby('region')['revenue'].transform('sum') * 100

# Normalize within groups (z-score)
df['revenue_zscore'] = df.groupby('region')['revenue'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Fill missing values with group mean
df_with_nulls = df.copy()
df_with_nulls.loc[0, 'revenue'] = np.nan
df_with_nulls['revenue_filled'] = df_with_nulls.groupby('region')['revenue'].transform(
    lambda x: x.fillna(x.mean())
)

Practical Tips and Performance

Reset the index when you want a flat DataFrame instead of a hierarchical index:

# Grouped result with MultiIndex
result = df.groupby(['region', 'product'])['revenue'].sum()
print(result.index)  # MultiIndex

# Reset to flat DataFrame
result_flat = df.groupby(['region', 'product'])['revenue'].sum().reset_index()
# Or use as_index=False
result_flat = df.groupby(['region', 'product'], as_index=False)['revenue'].sum()

Handle NaN values explicitly:

# By default, groupby excludes NaN keys
df_with_nan = df.copy()
df_with_nan.loc[0, 'region'] = np.nan

# Include NaN as a group (pandas 1.1+)
result = df_with_nan.groupby('region', dropna=False)['revenue'].sum()

Performance considerations for large datasets:

# Use categorical dtype for groupby columns
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')

# Avoid apply() when possible - use built-in methods
# Slow:
df.groupby('region')['revenue'].apply(lambda x: x.sum())
# Fast:
df.groupby('region')['revenue'].sum()

# Use observed=True with categoricals to skip empty combinations
df.groupby(['region', 'product'], observed=True)['revenue'].sum()

# For very large datasets, consider sort=False if order doesn't matter
df.groupby('region', sort=False)['revenue'].sum()

GroupBy operations are fundamental to data analysis in Pandas. Master the distinction between agg(), apply(), and transform(), prefer built-in aggregation functions for performance, and use named aggregations for readable code. These patterns will handle the vast majority of your aggregation needs.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.