How to GroupBy and Apply Custom Function in Pandas

Pandas GroupBy is one of the most powerful features for data analysis, but the real magic happens when you move beyond built-in aggregations like `sum()` and `mean()`. Custom functions let you...

Key Insights

  • Use apply() when you need access to the entire group DataFrame and want flexible output shapes, but prefer transform() when you need results aligned back to the original index
  • Custom aggregation functions in agg() receive a Series and must return a scalar, making them ideal for statistical calculations that aren’t built into pandas
  • Vectorized operations within custom functions can be 10-100x faster than row-by-row iteration—always operate on the group as a whole rather than looping through rows

Pandas GroupBy is one of the most powerful features for data analysis, but the real magic happens when you move beyond built-in aggregations like sum() and mean(). Custom functions let you implement business-specific logic, complex statistical calculations, and domain-driven transformations that no library could anticipate. This article shows you exactly how to write and apply these functions effectively.

GroupBy Basics Refresher

Before diving into custom functions, let’s establish the foundation. The GroupBy operation splits your data into groups based on one or more columns, applies a function to each group, and combines the results.

import pandas as pd
import numpy as np

# Sample sales data
df = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'product': ['A', 'B', 'A', 'B', 'A', 'A'],
    'revenue': [100, 150, 200, 120, 180, 90],
    'units': [10, 15, 25, 12, 20, 8]
})

# Standard aggregations
summary = df.groupby('region').agg({
    'revenue': ['sum', 'mean'],
    'units': ['sum', 'count']
})
print(summary)

This handles common cases, but what happens when you need weighted averages, custom statistical measures, or business rules that don’t fit into standard functions? That’s where custom functions come in.

Using apply() with Custom Functions

The apply() method is your most flexible tool. It passes the entire group DataFrame (or Series) to your function, giving you complete control over the computation.

Here’s a practical example: calculating weighted average price per region, where each sale is weighted by units sold.

def weighted_average_price(group):
    """Calculate revenue-weighted average price per unit."""
    total_revenue = group['revenue'].sum()
    total_units = group['units'].sum()
    weighted_avg = total_revenue / total_units
    
    return pd.Series({
        'weighted_avg_price': weighted_avg,
        'total_revenue': total_revenue,
        'total_units': total_units
    })

result = df.groupby('region').apply(weighted_average_price)
print(result)

Your function receives a DataFrame containing all rows for that group. You can perform any calculation, access multiple columns, and return either a scalar, Series, or DataFrame.

A more complex example—finding the best-performing product in each region:

def best_product(group):
    """Return the row with highest revenue in the group."""
    best_idx = group['revenue'].idxmax()
    return group.loc[best_idx]

best_by_region = df.groupby('region').apply(best_product)
print(best_by_region)

Important note: Starting with pandas 2.0, you should add include_groups=False when your function doesn’t need the grouping columns, as pandas will warn about including them by default.

Using transform() for Same-Shape Output

While apply() can return any shape, transform() specifically returns data aligned to the original DataFrame’s index. This is essential when you want to add group-level calculations back to your original rows.

# Normalize revenue within each region (z-score normalization)
def normalize_within_group(series):
    return (series - series.mean()) / series.std()

df['revenue_normalized'] = df.groupby('region')['revenue'].transform(normalize_within_group)
print(df)

Each row now has a normalized revenue score relative to its region. The North region’s values are compared only to other North values.

Another common use case—calculating each row’s percentage of group total:

df['pct_of_region'] = df.groupby('region')['revenue'].transform(
    lambda x: x / x.sum() * 100
)
print(df[['region', 'revenue', 'pct_of_region']])

The key difference: transform() must return a result with the same length as the input. Use it when you need to broadcast group statistics back to individual rows.

Using agg() with Custom Aggregations

The agg() method is optimized for aggregation functions—those that take a Series and return a single scalar value. It’s more efficient than apply() for this specific use case and supports applying multiple functions simultaneously.

def coefficient_of_variation(series):
    """Calculate CV: standard deviation relative to mean."""
    return series.std() / series.mean() * 100

def range_spread(series):
    """Calculate the range of values."""
    return series.max() - series.min()

# Apply multiple custom aggregations
result = df.groupby('region')['revenue'].agg([
    ('cv', coefficient_of_variation),
    ('spread', range_spread),
    ('median', 'median')  # Can mix custom and built-in
])
print(result)

For applying different functions to different columns, use a dictionary:

result = df.groupby('region').agg({
    'revenue': [('total', 'sum'), ('cv', coefficient_of_variation)],
    'units': [('avg', 'mean'), ('spread', range_spread)]
})
print(result)

Named aggregations provide cleaner syntax and flatter column names:

result = df.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    revenue_cv=('revenue', coefficient_of_variation),
    avg_units=('units', 'mean'),
    unit_spread=('units', range_spread)
)
print(result)

Lambda Functions for Quick Operations

For simple, one-off calculations, lambda functions keep your code concise. They’re particularly useful for operations that don’t warrant a named function.

Getting the top 2 products by revenue in each region:

top_2_per_region = df.groupby('region').apply(
    lambda g: g.nlargest(2, 'revenue'),
    include_groups=False
)
print(top_2_per_region)

Calculating interquartile range:

iqr_by_region = df.groupby('region')['revenue'].agg(
    lambda x: x.quantile(0.75) - x.quantile(0.25)
)
print(iqr_by_region)

Filtering groups based on a condition—keeping only regions with more than 2 sales:

filtered = df.groupby('region').filter(lambda g: len(g) > 2)
print(filtered)

Word of caution: Lambda functions are convenient but can become unreadable. If your lambda spans multiple lines or requires explanation, extract it into a named function.

Performance Considerations

Custom functions in GroupBy can become performance bottlenecks, especially with large datasets. Here’s how to keep them fast.

The slow way—iterating within your function:

# DON'T DO THIS
def slow_custom_calc(group):
    total = 0
    for idx, row in group.iterrows():
        total += row['revenue'] * row['units']
    return total

# This is painfully slow on large datasets
# %timeit df.groupby('region').apply(slow_custom_calc)

The fast way—vectorized operations:

# DO THIS INSTEAD
def fast_custom_calc(group):
    return (group['revenue'] * group['units']).sum()

# 10-100x faster
# %timeit df.groupby('region').apply(fast_custom_calc)

The difference becomes dramatic with larger datasets:

# Create larger dataset for benchmarking
large_df = pd.DataFrame({
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100000),
    'revenue': np.random.uniform(50, 500, 100000),
    'units': np.random.randint(1, 50, 100000)
})

# Vectorized approach
def vectorized_weighted_avg(group):
    return (group['revenue'] * group['units']).sum() / group['units'].sum()

result = large_df.groupby('region').apply(vectorized_weighted_avg)

Other performance tips:

  1. Use built-in aggregations when possible. They’re implemented in Cython and significantly faster than Python functions.

  2. Prefer agg() over apply() for aggregations. The agg() method has optimizations that apply() doesn’t.

  3. Consider numba for complex calculations. If you have computationally intensive custom functions, numba’s JIT compilation can provide substantial speedups.

  4. Avoid returning DataFrames from apply() when a Series suffices. DataFrame construction has overhead.

# If you only need one value per group, return a scalar or Series
def efficient_summary(group):
    return pd.Series({
        'metric1': group['revenue'].sum(),
        'metric2': group['units'].mean()
    })

# Not a DataFrame with one row
def inefficient_summary(group):
    return pd.DataFrame({
        'metric1': [group['revenue'].sum()],
        'metric2': [group['units'].mean()]
    })

Custom functions in pandas GroupBy operations unlock analytical possibilities that built-in methods can’t touch. Master apply() for flexibility, transform() for index-aligned results, and agg() for efficient aggregations. Keep your functions vectorized, and you’ll handle datasets of any size without breaking a sweat.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.