Pandas - GroupBy and Mean/Average

The `groupby()` method splits data into groups based on one or more columns, then applies an aggregation function. Here's the fundamental syntax for calculating means:

Key Insights

  • GroupBy operations in Pandas enable efficient aggregation of data by categorical variables, with mean calculations being one of the most common statistical operations for summarizing grouped data
  • The groupby() method creates a GroupBy object that lazily evaluates operations, allowing you to chain multiple aggregation functions and access groups individually for complex analysis
  • Performance optimization techniques like categorical data types, sorting before grouping, and selecting specific columns can dramatically reduce computation time on large datasets

Basic GroupBy with Mean

The groupby() method splits data into groups based on one or more columns, then applies an aggregation function. Here’s the fundamental syntax for calculating means:

import pandas as pd
import numpy as np

# Sample sales data
data = {
    'region': ['North', 'South', 'North', 'East', 'South', 'East', 'North'],
    'product': ['A', 'B', 'A', 'B', 'A', 'A', 'B'],
    'sales': [150, 200, 175, 220, 190, 210, 165],
    'units': [10, 15, 12, 18, 14, 16, 11]
}

df = pd.DataFrame(data)

# Calculate mean sales by region
mean_by_region = df.groupby('region')['sales'].mean()
print(mean_by_region)

Output:

region
East     215.0
North    163.333333
South    195.0
Name: sales, dtype: float64

The result is a Series with the grouping column as the index. To reset the index and return a DataFrame:

mean_df = df.groupby('region')['sales'].mean().reset_index()
mean_df.columns = ['region', 'avg_sales']
print(mean_df)

Multiple Column Grouping

Group by multiple columns to create hierarchical aggregations:

# Mean sales by region and product
multi_group = df.groupby(['region', 'product'])['sales'].mean()
print(multi_group)

Output:

region  product
East    A          210.0
        B          220.0
North   A          162.5
        B          165.0
South   A          190.0
        B          200.0
Name: sales, dtype: float64

Convert the MultiIndex to columns for easier manipulation:

result = df.groupby(['region', 'product'])['sales'].mean().reset_index()
result.columns = ['region', 'product', 'avg_sales']
print(result)

Aggregating Multiple Columns

Calculate means across multiple numeric columns simultaneously:

# Mean of all numeric columns by region
multi_col_mean = df.groupby('region')[['sales', 'units']].mean()
print(multi_col_mean)

Output:

           sales      units
region                     
East      215.000000  17.0
North     163.333333  11.0
South     195.000000  14.5

Use agg() for more control over which functions apply to which columns:

# Different aggregations for different columns
result = df.groupby('region').agg({
    'sales': 'mean',
    'units': ['mean', 'sum']
})
print(result)

Named Aggregations

Pandas 0.25+ supports named aggregations for cleaner column names:

result = df.groupby('region').agg(
    avg_sales=('sales', 'mean'),
    avg_units=('units', 'mean'),
    total_units=('units', 'sum'),
    max_sales=('sales', 'max')
)
print(result)

Output:

         avg_sales  avg_units  total_units  max_sales
region                                               
East     215.000000       17.0           34        220
North    163.333333       11.0           33        175
South    195.000000       14.5           29        200

This approach eliminates the need for column renaming after aggregation.

Custom Aggregation Functions

Apply custom functions using lambda expressions or defined functions:

# Calculate weighted average
def weighted_mean(group):
    return np.average(group['sales'], weights=group['units'])

weighted_avg = df.groupby('region').apply(weighted_mean)
print(weighted_avg)

For multiple custom calculations:

def custom_stats(group):
    return pd.Series({
        'mean_sales': group['sales'].mean(),
        'median_sales': group['sales'].median(),
        'std_sales': group['sales'].std(),
        'coefficient_variation': group['sales'].std() / group['sales'].mean()
    })

stats = df.groupby('region').apply(custom_stats)
print(stats)

Handling Missing Values

GroupBy operations handle NaN values differently than standard operations:

# Data with missing values
data_missing = {
    'category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'value': [10, 20, np.nan, 25, 15, np.nan]
}

df_missing = pd.DataFrame(data_missing)

# By default, mean() skips NaN values
print(df_missing.groupby('category')['value'].mean())

Output:

category
A    12.5
B    22.5
Name: value, dtype: float64

To include groups with all NaN values:

# Using transform to preserve original index
df_missing['group_mean'] = df_missing.groupby('category')['value'].transform('mean')
print(df_missing)

Performance Optimization

For large datasets, optimize GroupBy operations:

# Convert to categorical for faster grouping
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')

# Select only necessary columns before grouping
result = df[['region', 'sales']].groupby('region').mean()

# Sort before grouping if data is naturally ordered
df_sorted = df.sort_values('region')
result = df_sorted.groupby('region')['sales'].mean()

Benchmark different approaches:

import time

# Create larger dataset
large_df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000),
    'value': np.random.randn(1000000)
})

# Method 1: Standard groupby
start = time.time()
result1 = large_df.groupby('category')['value'].mean()
print(f"Standard: {time.time() - start:.4f}s")

# Method 2: With categorical
large_df['category'] = large_df['category'].astype('category')
start = time.time()
result2 = large_df.groupby('category')['value'].mean()
print(f"Categorical: {time.time() - start:.4f}s")

Filtering Groups After Aggregation

Filter groups based on aggregated values:

# Only regions with average sales > 180
high_sales_regions = df.groupby('region')['sales'].mean()
high_sales_regions = high_sales_regions[high_sales_regions > 180]
print(high_sales_regions)

Use filter() to return original rows from qualifying groups:

# Return all rows from regions with mean sales > 180
filtered_df = df.groupby('region').filter(lambda x: x['sales'].mean() > 180)
print(filtered_df)

Combining GroupBy with Other Operations

Integrate GroupBy results back into the original DataFrame:

# Add group means as a new column
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
df['deviation_from_avg'] = df['sales'] - df['region_avg_sales']
print(df)

Calculate percentile ranks within groups:

df['percentile_rank'] = df.groupby('region')['sales'].rank(pct=True)
print(df[['region', 'sales', 'percentile_rank']])

The GroupBy-mean pattern forms the foundation for exploratory data analysis, reporting dashboards, and feature engineering in machine learning pipelines. Master these techniques to efficiently summarize and analyze categorical data at scale.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.