Pandas: GroupBy with DataFrames

Key Insights

GroupBy follows the split-apply-combine pattern: split data into groups, apply a function to each group, then combine results back into a single structure
Use agg() for reducing groups to single values, transform() for broadcasting results back to original shape, and apply() for complex operations that don’t fit either pattern
Converting groupby columns to categorical dtype before grouping can dramatically improve performance on large datasets with repeated values

Introduction to GroupBy Operations

The GroupBy operation is one of the most powerful features in pandas, yet many developers underutilize it or misuse it entirely. At its core, GroupBy implements the split-apply-combine paradigm: you split your data into groups based on some criteria, apply a function to each group independently, then combine the results into a new data structure.

This pattern appears constantly in real-world data analysis. You need sales totals by region. Average response times by endpoint. Customer counts by signup month. Any time you’re asking “what’s the X for each Y,” you’re thinking in GroupBy terms.

The alternative—writing loops to filter and aggregate manually—is slower, more error-prone, and harder to read. Master GroupBy and you’ll write cleaner, faster pandas code.

Basic GroupBy Syntax and Single-Column Grouping

Let’s start with a practical example. You have sales data and need to analyze it by region.

import pandas as pd
import numpy as np

# Sample sales data
df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'East', 'South', 'East', 'North', 'South'],
    'product': ['Widget', 'Widget', 'Gadget', 'Widget', 'Gadget', 'Gadget', 'Widget', 'Widget'],
    'sales': [250, 150, 300, 275, 225, 180, 320, 195],
    'quantity': [10, 8, 12, 11, 9, 7, 13, 10]
})

# Basic groupby with single aggregation
region_totals = df.groupby('region')['sales'].sum()
print(region_totals)
# region
# East     455
# North    870
# South    570
# Name: sales, dtype: int64

# Multiple aggregations on single column
region_stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print(region_stats)
#         sum    mean  count
# region                    
# East    455  227.50      2
# North   870  290.00      3
# South   570  190.00      3

Notice that groupby() returns a GroupBy object, not a DataFrame. The aggregation happens when you call a method like sum() or agg(). This lazy evaluation means pandas doesn’t do unnecessary work if you chain operations.

The result has the groupby column as the index by default. If you want it as a regular column, pass as_index=False or call reset_index() afterward.

Multi-Column Grouping

Real analysis often requires grouping by multiple dimensions simultaneously. Pass a list of column names to create hierarchical groups.

# Group by region AND product
multi_group = df.groupby(['region', 'product'])['sales'].sum()
print(multi_group)
# region  product
# East    Gadget     180
#         Widget     275
# North   Gadget     300
#         Widget     570
# South   Gadget     225
#         Widget     345
# Name: sales, dtype: int64

# Get a DataFrame instead of Series
multi_group_df = df.groupby(['region', 'product'], as_index=False)['sales'].sum()
print(multi_group_df)
#   region product  sales
# 0   East  Gadget    180
# 1   East  Widget    275
# 2  North  Gadget    300
# 3  North  Widget    570
# 4  South  Gadget    225
# 5  South  Widget    345

# Unstack to create a pivot-like view
pivoted = df.groupby(['region', 'product'])['sales'].sum().unstack(fill_value=0)
print(pivoted)
# product  Gadget  Widget
# region                 
# East        180     275
# North       300     570
# South       225     345

The multi-index result can be powerful for hierarchical analysis, but it’s often easier to work with a flat DataFrame. Use as_index=False when you need to continue manipulating the results.

Aggregation Functions: agg, transform, and apply

Understanding the difference between these three methods is crucial. They serve fundamentally different purposes.

agg() reduces each group to a single row. Use it when you want summary statistics.

# Multiple aggregations on multiple columns
summary = df.groupby('region').agg({
    'sales': ['sum', 'mean', 'std'],
    'quantity': ['sum', 'mean']
})
print(summary)
#        sales                   quantity      
#          sum    mean        std      sum mean
# region                                       
# East     455  227.50  67.175144       18  9.0
# North    870  290.00  36.055513       35 11.67
# South    570  190.00  37.749172       27  9.0

transform() returns a result with the same shape as the input. Each row gets the aggregated value of its group. This is perfect for adding group-level statistics back to individual rows.

# Add group mean to each row
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
print(df[['region', 'sales', 'region_avg_sales']])
#   region  sales  region_avg_sales
# 0  North    250            290.00
# 1  South    150            190.00
# 2  North    300            290.00
# 3   East    275            227.50
# 4  South    225            190.00
# 5   East    180            227.50
# 6  North    320            290.00
# 7  South    195            190.00

# Calculate percentage of group total
df['pct_of_region'] = df['sales'] / df.groupby('region')['sales'].transform('sum') * 100

apply() is the most flexible but slowest. Use it when you need operations that don’t fit agg() or transform().

# Custom function that returns multiple rows per group
def top_n_sales(group, n=1):
    return group.nlargest(n, 'sales')

top_by_region = df.groupby('region').apply(top_n_sales, include_groups=False)
print(top_by_region)

Named Aggregations and Custom Functions

Pandas 0.25 introduced named aggregations, which produce cleaner column names and allow mixing built-in and custom functions.

# Named aggregation syntax
result = df.groupby('region').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_quantity=('quantity', 'sum'),
    transaction_count=('sales', 'count')
)
print(result)
#         total_sales  avg_sales  total_quantity  transaction_count
# region                                                           
# East            455     227.50              18                  2
# North           870     290.00              35                  3
# South           570     190.00              27                  3

# Custom functions with named aggregations
def coefficient_of_variation(x):
    return x.std() / x.mean() * 100

result = df.groupby('region').agg(
    total_sales=('sales', 'sum'),
    sales_cv=('sales', coefficient_of_variation),
    sales_range=('sales', lambda x: x.max() - x.min())
)
print(result)

Named aggregations are clearer than the dictionary syntax and should be your default choice for complex aggregations.

Filtering Groups and Handling Missing Data

Sometimes you need to filter out entire groups based on group-level criteria. The filter() method handles this elegantly.

# Keep only regions with more than 2 transactions
filtered = df.groupby('region').filter(lambda x: len(x) > 2)
print(filtered)
#   region product  sales  quantity
# 0  North  Widget    250        10
# 2  North  Gadget    300        12
# 6  North  Widget    320        13
# 1  South  Widget    150         8
# 4  South  Gadget    225         9
# 7  South  Widget    195        10

# Keep groups where average sales exceed threshold
high_performers = df.groupby('region').filter(lambda x: x['sales'].mean() > 200)

For missing data, GroupBy excludes NaN values from groupby columns by default. Control this with the dropna parameter.

# Sample data with missing values
df_with_nan = df.copy()
df_with_nan.loc[0, 'region'] = np.nan

# Default: excludes NaN groups
print(df_with_nan.groupby('region')['sales'].sum())

# Include NaN as a group
print(df_with_nan.groupby('region', dropna=False)['sales'].sum())

Performance Tips and Common Pitfalls

GroupBy performance matters when you’re working with millions of rows. Here are concrete optimizations.

Convert to categorical before grouping. This is the single biggest performance win for columns with repeated values.

import time

# Create larger dataset for benchmarking
large_df = pd.DataFrame({
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=1_000_000),
    'sales': np.random.randint(100, 1000, size=1_000_000)
})

# Without categorical
start = time.time()
result1 = large_df.groupby('region')['sales'].sum()
print(f"String groupby: {time.time() - start:.3f}s")

# With categorical
large_df['region_cat'] = large_df['region'].astype('category')
start = time.time()
result2 = large_df.groupby('region_cat')['sales'].sum()
print(f"Categorical groupby: {time.time() - start:.3f}s")
# Typically 2-5x faster

Avoid apply() when possible. It’s the slowest GroupBy method because it can’t be vectorized. If you can express your operation with agg() or transform(), do so.

Use built-in aggregation functions. String names like 'sum' and 'mean' are optimized in Cython. Lambda functions force slower Python execution.

# Fast: uses optimized Cython code
df.groupby('region')['sales'].agg('sum')

# Slower: Python function call overhead
df.groupby('region')['sales'].agg(lambda x: x.sum())

Reset index strategically. If you’re chaining operations, keeping the grouped index can be useful. If you’re done aggregating, reset_index() or as_index=False gives you a cleaner DataFrame to work with.

GroupBy is fundamental to pandas data analysis. Learn these patterns, understand when to use each method, and your data manipulation code will be cleaner, faster, and more maintainable.