Pandas - GroupBy Multiple Columns | Application Architect

Key Insights

• GroupBy with multiple columns creates hierarchical indexes that enable multi-dimensional data aggregation, essential for analyzing data across multiple categorical dimensions simultaneously. • Understanding the difference between stacked (MultiIndex) and unstacked (pivoted) groupby results is critical for choosing the right data structure for your analysis pipeline. • Performance optimization through categorical data types and proper column ordering can reduce groupby operation time by 50-80% on large datasets.

Basic Multi-Column GroupBy Syntax

GroupBy operations with multiple columns in pandas create hierarchical groupings where data is first grouped by the first column, then within each group, further subdivided by the second column, and so on.

import pandas as pd
import numpy as np

# Sample sales data
data = {
    'region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2'],
    'revenue': [100, 150, 120, 180, 110, 200],
    'units': [10, 15, 12, 18, 11, 20]
}

df = pd.DataFrame(data)

# Group by multiple columns
grouped = df.groupby(['region', 'product'])
print(grouped.sum())

Output:

                revenue  units
region product                
North  A            210     21
       B            150     15
South  A            120     12
       B            380     38

The resulting MultiIndex structure organizes data hierarchically, making it straightforward to access specific group combinations or perform operations at different aggregation levels.

Multiple Aggregation Functions

Apply different aggregation functions to different columns or multiple functions to the same column using agg().

# Different aggregations per column
result = df.groupby(['region', 'product']).agg({
    'revenue': ['sum', 'mean', 'std'],
    'units': ['sum', 'max'],
    'quarter': 'count'
})

print(result)

# Named aggregations (pandas >= 0.25.0)
result_named = df.groupby(['region', 'product']).agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    total_units=('units', 'sum'),
    transaction_count=('quarter', 'count')
)

print(result_named)

Named aggregations produce cleaner column names and eliminate the MultiIndex column structure, making downstream processing simpler.

Flattening and Reshaping GroupBy Results

Multi-column groupby operations create MultiIndex structures. Choose between keeping them hierarchical or flattening based on your use case.

# Reset index to convert MultiIndex to columns
flat_result = df.groupby(['region', 'product']).sum().reset_index()
print(flat_result)

# Unstack to pivot one level of the index
pivoted = df.groupby(['region', 'product'])['revenue'].sum().unstack()
print(pivoted)

# Unstack with fill_value for missing combinations
pivoted_filled = df.groupby(['region', 'product'])['revenue'].sum().unstack(fill_value=0)
print(pivoted_filled)

# Flatten MultiIndex columns after complex aggregations
multi_agg = df.groupby(['region', 'product']).agg({
    'revenue': ['sum', 'mean'],
    'units': 'sum'
})

# Flatten column MultiIndex
multi_agg.columns = ['_'.join(col).strip() for col in multi_agg.columns.values]
multi_agg = multi_agg.reset_index()
print(multi_agg)

Filtering Groups Based on Aggregate Conditions

Filter entire groups based on aggregate conditions using filter(), which differs from row-level filtering.

# Keep only groups where total revenue exceeds 200
high_revenue_groups = df.groupby(['region', 'product']).filter(
    lambda x: x['revenue'].sum() > 200
)
print(high_revenue_groups)

# Multiple filter conditions
filtered = df.groupby(['region', 'product']).filter(
    lambda x: (x['revenue'].sum() > 150) and (x['units'].mean() > 15)
)
print(filtered)

# Alternative: filter after aggregation
agg_result = df.groupby(['region', 'product']).sum()
filtered_agg = agg_result[agg_result['revenue'] > 200]
print(filtered_agg)

The filter() method returns the original DataFrame rows that belong to qualifying groups, while post-aggregation filtering returns the aggregated results themselves.

Advanced Grouping with Custom Functions

Apply custom transformation and aggregation logic beyond built-in functions.

# Custom aggregation function
def revenue_per_unit(group):
    return group['revenue'].sum() / group['units'].sum()

custom_agg = df.groupby(['region', 'product']).apply(revenue_per_unit)
print(custom_agg)

# Multiple custom calculations
def group_metrics(group):
    return pd.Series({
        'revenue_per_unit': group['revenue'].sum() / group['units'].sum(),
        'max_transaction': group['revenue'].max(),
        'transactions': len(group)
    })

detailed_metrics = df.groupby(['region', 'product']).apply(group_metrics)
print(detailed_metrics)

# Transform: broadcast group-level statistics back to original rows
df['group_avg_revenue'] = df.groupby(['region', 'product'])['revenue'].transform('mean')
df['pct_of_group_total'] = df['revenue'] / df.groupby(['region', 'product'])['revenue'].transform('sum')
print(df)

Performance Optimization Strategies

Optimize groupby operations on large datasets through strategic data type choices and operation ordering.

# Convert to categorical for repeated groupby operations
df_optimized = df.copy()
df_optimized['region'] = df_optimized['region'].astype('category')
df_optimized['product'] = df_optimized['product'].astype('category')
df_optimized['quarter'] = df_optimized['quarter'].astype('category')

# Benchmark difference
import time

large_df = pd.concat([df] * 10000, ignore_index=True)

start = time.time()
result1 = large_df.groupby(['region', 'product']).sum()
print(f"Standard: {time.time() - start:.4f}s")

large_df_cat = large_df.copy()
large_df_cat['region'] = large_df_cat['region'].astype('category')
large_df_cat['product'] = large_df_cat['product'].astype('category')

start = time.time()
result2 = large_df_cat.groupby(['region', 'product']).sum()
print(f"Categorical: {time.time() - start:.4f}s")

# Use observed=True to skip empty category combinations
result_observed = large_df_cat.groupby(
    ['region', 'product'], 
    observed=True
).sum()

Grouping with Time-Based Columns

Combine categorical and temporal grouping for time-series analysis.

# Time-series grouping example
ts_data = {
    'date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'region': ['North', 'South'] * 6,
    'product': ['A', 'A', 'B', 'B'] * 3,
    'revenue': np.random.randint(100, 300, 12)
}

ts_df = pd.DataFrame(ts_data)

# Group by region, product, and quarter
ts_df['quarter'] = ts_df['date'].dt.to_period('Q')
quarterly = ts_df.groupby(['region', 'product', 'quarter']).agg({
    'revenue': ['sum', 'mean'],
    'date': 'count'
})
print(quarterly)

# Resample with groupby
ts_df.set_index('date', inplace=True)
monthly_by_region = ts_df.groupby('region').resample('Q')['revenue'].sum()
print(monthly_by_region)

Handling Missing Values in Groups

Control how NaN values affect groupby operations and results.

# Data with missing values
data_nan = {
    'region': ['North', 'North', 'South', None, 'North'],
    'product': ['A', 'B', 'A', 'B', None],
    'revenue': [100, 150, np.nan, 180, 110]
}

df_nan = pd.DataFrame(data_nan)

# Default: NaN in grouping columns creates separate group
result_default = df_nan.groupby(['region', 'product']).sum()
print(result_default)

# dropna=False to include NaN as a group key (pandas >= 1.1.0)
result_with_nan = df_nan.groupby(['region', 'product'], dropna=False).sum()
print(result_with_nan)

# Handle NaN in aggregated columns
result_mean = df_nan.groupby(['region', 'product'])['revenue'].mean()  # Ignores NaN
result_count = df_nan.groupby(['region', 'product'])['revenue'].count()  # Excludes NaN
result_size = df_nan.groupby(['region', 'product']).size()  # Includes NaN

Multi-column groupby operations form the foundation of dimensional data analysis in pandas. Master these patterns to efficiently aggregate, transform, and analyze complex datasets across multiple categorical dimensions.