How to Use GroupBy in Pandas

Pandas GroupBy is one of those features that separates beginners from practitioners. Once you internalize it, you'll find yourself reaching for it constantly—summarizing sales by region, calculating...

Key Insights

  • GroupBy implements the split-apply-combine pattern: split your data into groups, apply a function to each group, then combine results back into a single structure
  • Use .agg() for complex aggregations—it lets you apply different functions to different columns and even multiple functions to the same column in one operation
  • Convert grouping columns to categorical dtype before grouping large datasets; this single change can cut memory usage by 90% and speed up operations significantly

Introduction to GroupBy

Pandas GroupBy is one of those features that separates beginners from practitioners. Once you internalize it, you’ll find yourself reaching for it constantly—summarizing sales by region, calculating averages by category, normalizing data within groups.

The underlying concept is called split-apply-combine. You split your DataFrame into groups based on some criteria, apply a function to each group independently, then combine the results. This pattern appears everywhere in data analysis, and pandas makes it remarkably efficient.

When should you use groupby? Any time you’re asking questions like “what’s the average X per Y?” or “how does Z break down by category?” If you’re writing loops to filter data and calculate statistics, there’s almost certainly a groupby solution that’s cleaner and faster.

Basic GroupBy Syntax

The simplest groupby operation takes a column name and applies an aggregation function:

import pandas as pd

# Sample sales data
df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Widget', 'Gadget'],
    'revenue': [100, 150, 200, 180, 120, 220],
    'units': [10, 15, 8, 9, 12, 11]
})

# Group by region, sum all numeric columns
revenue_by_region = df.groupby('region').sum()
print(revenue_by_region)

Output:

        revenue  units
region                
North       420     30
South       550     35

Notice that pandas automatically sums all numeric columns. The grouping column becomes the index. If you want it as a regular column, add as_index=False:

revenue_by_region = df.groupby('region', as_index=False).sum()

Grouping by multiple columns works the same way—pass a list:

# Group by both region and product
breakdown = df.groupby(['region', 'product']).mean()
print(breakdown)

Output:

                 revenue  units
region product                 
North  Gadget      200.0    8.0
       Widget      110.0   11.0
South  Gadget      200.0   10.0
       Widget      150.0   15.0

This creates a hierarchical index (MultiIndex). To flatten it:

breakdown = df.groupby(['region', 'product'], as_index=False).mean()

Common Aggregation Functions

Pandas provides a solid set of built-in aggregations. Here’s what you’ll use most often:

Function Description
sum() Sum of values
mean() Average
count() Number of non-null values
size() Number of rows (includes nulls)
min() / max() Minimum / Maximum
std() / var() Standard deviation / Variance
first() / last() First / Last value in group
nunique() Number of unique values

Chain them directly after groupby:

# Multiple aggregations on a larger dataset
sales = pd.DataFrame({
    'store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
    'revenue': [1000, 1200, 1100, 800, 850, 900, 1500, 1400, 1600],
    'transactions': [50, 60, 55, 40, 42, 45, 75, 70, 80]
})

# Quick summary by store
summary = sales.groupby('store').agg({
    'revenue': ['sum', 'mean'],
    'transactions': ['sum', 'mean', 'max']
})
print(summary)

Output:

      revenue        transactions          
          sum    mean          sum  mean max
store                                       
A        3300  1100.0          165  55.0  60
B        2550   850.0          127  42.3  45
C        4500  1500.0          225  75.0  80

Custom Aggregations with agg()

The .agg() method is where groupby gets powerful. You can apply different functions to different columns, use multiple functions per column, and define custom aggregations.

# Named aggregations (cleaner column names)
result = sales.groupby('store').agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    total_transactions=('transactions', 'sum'),
    peak_transactions=('transactions', 'max')
)
print(result)

Output:

       total_revenue  avg_revenue  total_transactions  peak_transactions
store                                                                    
A               3300       1100.0                 165                 60
B               2550        850.0                 127                 45
C               4500       1500.0                 225                 80

For custom logic, use lambda functions or define your own:

# Custom aggregation: range (max - min)
def revenue_range(x):
    return x.max() - x.min()

result = sales.groupby('store').agg(
    revenue_range=('revenue', revenue_range),
    revenue_cv=('revenue', lambda x: x.std() / x.mean()),  # coefficient of variation
    transaction_median=('transactions', 'median')
)
print(result)

One gotcha: lambda functions are slower than built-in aggregations because pandas can’t optimize them. For large datasets, prefer built-in functions when possible.

Transform and Filter Operations

Sometimes you don’t want aggregated results—you want values mapped back to the original rows. That’s what transform() does.

# Add a column showing each store's average revenue
sales['store_avg_revenue'] = sales.groupby('store')['revenue'].transform('mean')

# Normalize revenue within each store (z-score)
sales['revenue_zscore'] = sales.groupby('store')['revenue'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Calculate percentage of store total
sales['pct_of_store'] = sales.groupby('store')['revenue'].transform(
    lambda x: x / x.sum() * 100
)
print(sales)

Transform is invaluable for feature engineering—creating group-level statistics that you need at the row level.

The filter() method removes entire groups based on a condition:

# Keep only stores with total revenue > 3000
high_performers = sales.groupby('store').filter(lambda x: x['revenue'].sum() > 3000)
print(high_performers)

This keeps all rows from stores A and C, but removes all rows from store B.

Iterating Over Groups

Sometimes you need to process each group individually. GroupBy objects are iterable:

for store_name, store_data in sales.groupby('store'):
    print(f"\n=== Store {store_name} ===")
    print(f"Total Revenue: ${store_data['revenue'].sum():,}")
    print(f"Months: {store_data['month'].tolist()}")

You can also access specific groups directly:

# Get a specific group
store_a = sales.groupby('store').get_group('A')
print(store_a)

For more complex scenarios, use apply() with a custom function:

def analyze_store(group):
    return pd.Series({
        'revenue_trend': 'up' if group['revenue'].iloc[-1] > group['revenue'].iloc[0] else 'down',
        'best_month': group.loc[group['revenue'].idxmax(), 'month'],
        'consistency': 'high' if group['revenue'].std() < 100 else 'low'
    })

analysis = sales.groupby('store').apply(analyze_store)
print(analysis)

Performance Tips and Common Pitfalls

GroupBy performance matters when you’re dealing with millions of rows. Here are the optimizations that actually make a difference.

Convert to categorical dtype. This is the single biggest win for repeated groupby operations:

import numpy as np

# Create a larger dataset for benchmarking
np.random.seed(42)
large_df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], size=1_000_000),
    'subcategory': np.random.choice(['X', 'Y', 'Z'], size=1_000_000),
    'value': np.random.randn(1_000_000)
})

# Before: string columns
print(f"Memory usage (strings): {large_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# After: categorical columns
large_df['category'] = large_df['category'].astype('category')
large_df['subcategory'] = large_df['subcategory'].astype('category')
print(f"Memory usage (categorical): {large_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# Groupby is now significantly faster
result = large_df.groupby(['category', 'subcategory'])['value'].mean()

On my machine, this reduces memory from ~70MB to ~10MB and speeds up groupby operations by 3-4x.

Avoid chained operations when possible. Each groupby creates overhead:

# Slower: multiple groupby calls
mean_vals = df.groupby('category')['value'].mean()
sum_vals = df.groupby('category')['value'].sum()

# Faster: single groupby with multiple aggregations
result = df.groupby('category')['value'].agg(['mean', 'sum'])

Use observed=True with categorical columns. By default, groupby includes all category levels, even unused ones:

df['category'] = pd.Categorical(df['category'], categories=['A', 'B', 'C', 'D', 'E', 'F'])
# This creates groups for F even if it doesn't exist in data
result = df.groupby('category').sum()

# This only includes categories that actually appear
result = df.groupby('category', observed=True).sum()

Reset index judiciously. If you’re doing further operations, keeping the grouped index can be more efficient than resetting it repeatedly.

Watch out for the SettingWithCopyWarning. When assigning transform results back to the DataFrame, make sure you’re working with the original DataFrame, not a view:

# Potentially problematic
df[df['category'] == 'A']['normalized'] = df.groupby('category')['value'].transform('mean')

# Safe
df.loc[df['category'] == 'A', 'normalized'] = df.groupby('category')['value'].transform('mean')

GroupBy is foundational to pandas data analysis. Master these patterns—basic aggregation, custom agg functions, transform for row-level results, and the performance optimizations—and you’ll handle most data summarization tasks efficiently. The key is recognizing when a problem fits the split-apply-combine pattern, which, once you start looking, is surprisingly often.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.