How to GroupBy Multiple Columns in Pandas

Key Insights

Multi-column groupby creates hierarchical groupings that let you analyze data across multiple dimensions simultaneously—essential for any real-world data analysis beyond toy examples.
Named aggregations with the agg() method give you precise control over output column names and which functions apply to which columns, eliminating the messy MultiIndex columns that plague beginners.
Always consider resetting your index after groupby operations unless you specifically need hierarchical indexing—flat DataFrames are easier to work with and play nicer with other tools.

Introduction

Single-column groupby operations are fine for tutorials, but real data analysis rarely works that way. You need to group sales by region and product category. You need to analyze user behavior by subscription tier and signup month. You need to aggregate metrics by department and quarter.

Multi-column groupby is where Pandas becomes genuinely useful for business analytics. Once you understand how to group by multiple columns and apply sophisticated aggregations, you can answer complex questions with just a few lines of code.

This article covers everything you need to know: the syntax, aggregation patterns, handling the resulting MultiIndex, and performance considerations that matter when your DataFrames grow beyond a few thousand rows.

Basic Syntax for Multi-Column GroupBy

The syntax is straightforward—pass a list of column names instead of a single string:

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'product': ['A', 'B', 'A', 'B', 'A', 'A'],
    'sales': [100, 150, 200, 120, 180, 90],
    'quantity': [10, 15, 20, 12, 18, 9]
})

# Group by multiple columns
grouped = df.groupby(['region', 'product'])

# Apply aggregation
result = grouped['sales'].sum()
print(result)

Output:

region  product
North   A          280
        B          150
South   A          290
        B          120
Name: sales, dtype: int64

Notice the hierarchical index structure. Pandas creates a MultiIndex where each unique combination of region and product becomes a single group. The order matters—['region', 'product'] groups by region first, then by product within each region.

You can aggregate multiple columns at once:

result = df.groupby(['region', 'product'])[['sales', 'quantity']].mean()
print(result)

Output:

                 sales  quantity
region product                  
North  A         140.0       9.0
       B         150.0      15.0
South  A         145.0      14.5
       B         120.0      12.0

Applying Aggregation Functions

Basic aggregations like .sum() and .mean() only get you so far. The .agg() method is where multi-column groupby becomes powerful.

Different Functions for Different Columns

Use a dictionary to specify which aggregation applies to which column:

result = df.groupby(['region', 'product']).agg({
    'sales': 'sum',
    'quantity': 'mean'
})
print(result)

Output:

                 sales  quantity
region product                  
North  A           280       9.0
       B           150      15.0
South  A           290      14.5
       B           120      12.0

Multiple Aggregations Per Column

Apply multiple functions to the same column:

result = df.groupby(['region', 'product']).agg({
    'sales': ['sum', 'mean', 'count'],
    'quantity': ['min', 'max']
})
print(result)

This creates a MultiIndex for columns, which gets ugly fast. The column names become tuples like ('sales', 'sum').

Named Aggregations (The Better Way)

Named aggregations solve the messy column problem. This syntax is cleaner and produces flat column names:

result = df.groupby(['region', 'product']).agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    transaction_count=('sales', 'count'),
    min_quantity=('quantity', 'min'),
    max_quantity=('quantity', 'max')
)
print(result)

Output:

                 total_sales  avg_sales  transaction_count  min_quantity  max_quantity
region product                                                                        
North  A                 280      140.0                  2             9            18
       B                 150      150.0                  1            15            15
South  A                 290      145.0                  2             9            20
       B                 120      120.0                  1            12            12

This is the pattern I use in production code. The output columns have meaningful names, and there’s no post-processing needed to flatten anything.

Working with the Resulting MultiIndex

The hierarchical index looks elegant in examples but causes headaches in practice. Most downstream operations—merging, exporting to CSV, feeding into visualization libraries—work better with flat DataFrames.

Flattening with reset_index()

The simplest fix:

result = df.groupby(['region', 'product']).agg(
    total_sales=('sales', 'sum')
).reset_index()

print(result)

Output:

  region product  total_sales
0  North       A          280
1  North       B          150
2  South       A          290
3  South       B          120

Now region and product are regular columns. This is almost always what you want.

Accessing Data in a MultiIndex

If you keep the MultiIndex, here’s how to access specific groups:

# Keep the MultiIndex
result = df.groupby(['region', 'product'])['sales'].sum()

# Access a specific group with tuple
print(result.loc[('North', 'A')])  # Output: 280

# Access all products in a region using xs()
print(result.xs('North', level='region'))

Output of xs():

product
A    280
B    150
Name: sales, dtype: int64

The .xs() method (cross-section) is useful when you need to slice along one level of the index while keeping the structure intact.

Filtering Groups with filter() and transform()

Sometimes you need to filter entire groups based on aggregate conditions, or add aggregated values back to each row.

Filtering Groups

The .filter() method keeps or drops entire groups based on a condition:

# Keep only region-product combinations with more than 1 transaction
filtered = df.groupby(['region', 'product']).filter(
    lambda x: len(x) > 1
)
print(filtered)

Output:

  region product  sales  quantity
0  North       A    100        10
2  South       A    200        20
4  North       A    180        18
5  South       A     90         9

Groups with only one row (North-B and South-B) are excluded entirely.

Transform for Row-Level Results

The .transform() method returns results aligned with the original DataFrame:

# Add group mean as a new column
df['group_avg_sales'] = df.groupby(['region', 'product'])['sales'].transform('mean')
print(df)

Output:

  region product  sales  quantity  group_avg_sales
0  North       A    100        10            140.0
1  North       B    150        15            150.0
2  South       A    200        20            145.0
3  South       B    120        12            120.0
4  North       A    180        18            140.0
5  South       A     90         9            145.0

Each row gets the mean of its group. This is invaluable for calculating deviations from group averages or creating normalized metrics.

Practical Use Case: Sales Data Analysis

Let’s work through a realistic example. You have sales data and need to analyze performance by region and product category:

# Create realistic sales data
np.random.seed(42)
n_rows = 1000

sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=n_rows, freq='H'),
    'region': np.random.choice(['Northeast', 'Southeast', 'Midwest', 'West'], n_rows),
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Sports'], n_rows),
    'revenue': np.random.uniform(50, 500, n_rows).round(2),
    'units_sold': np.random.randint(1, 20, n_rows),
    'customer_id': np.random.randint(1000, 9999, n_rows)
})

# Comprehensive analysis by region and category
summary = sales_data.groupby(['region', 'category']).agg(
    total_revenue=('revenue', 'sum'),
    avg_order_value=('revenue', 'mean'),
    total_units=('units_sold', 'sum'),
    num_transactions=('revenue', 'count'),
    unique_customers=('customer_id', 'nunique')
).reset_index()

# Calculate revenue per customer
summary['revenue_per_customer'] = (
    summary['total_revenue'] / summary['unique_customers']
).round(2)

# Sort by total revenue
summary = summary.sort_values('total_revenue', ascending=False)
print(summary.head(10))

This gives you a complete breakdown of performance across all region-category combinations with meaningful metrics.

Performance Tips

Multi-column groupby can get slow with large DataFrames. Here’s how to keep things fast:

Use sort=False When Order Doesn’t Matter

# Faster if you don't need sorted output
result = df.groupby(['region', 'product'], sort=False).sum()

Pandas sorts groups by default. If you’re just aggregating and don’t care about order, skip it.

Convert to Categorical

# Convert string columns to categorical
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')

# Now groupby is faster
result = df.groupby(['region', 'product']).sum()

Categorical dtypes use integer codes internally, making groupby operations significantly faster on string columns.

Avoid Chained Operations

# Slower: multiple passes through data
result = df.groupby(['region', 'product'])['sales'].sum()
result2 = df.groupby(['region', 'product'])['quantity'].mean()

# Faster: single pass
result = df.groupby(['region', 'product']).agg(
    total_sales=('sales', 'sum'),
    avg_quantity=('quantity', 'mean')
)

Each groupby operation iterates through your data. Combine aggregations into a single .agg() call whenever possible.

Multi-column groupby is a fundamental skill for data analysis in Pandas. Master the patterns in this article—especially named aggregations and proper index handling—and you’ll handle most real-world aggregation tasks efficiently.

Introduction

Basic Syntax for Multi-Column GroupBy

Applying Aggregation Functions

Different Functions for Different Columns

Multiple Aggregations Per Column

Named Aggregations (The Better Way)

Working with the Resulting MultiIndex

Flattening with reset_index()

Accessing Data in a MultiIndex

Filtering Groups with filter() and transform()

Filtering Groups

Transform for Row-Level Results

Practical Use Case: Sales Data Analysis

Performance Tips

Use sort=False When Order Doesn’t Matter

Convert to Categorical

Avoid Chained Operations

Liked this? There's more.

Similar Articles