Pandas - GroupBy and Mean/Average
The `groupby()` method splits data into groups based on one or more columns, then applies an aggregation function. Here's the fundamental syntax for calculating means:
Key Insights
- GroupBy operations in Pandas enable efficient aggregation of data by categorical variables, with mean calculations being one of the most common statistical operations for summarizing grouped data
- The
groupby()method creates a GroupBy object that lazily evaluates operations, allowing you to chain multiple aggregation functions and access groups individually for complex analysis - Performance optimization techniques like categorical data types, sorting before grouping, and selecting specific columns can dramatically reduce computation time on large datasets
Basic GroupBy with Mean
The groupby() method splits data into groups based on one or more columns, then applies an aggregation function. Here’s the fundamental syntax for calculating means:
import pandas as pd
import numpy as np
# Sample sales data
data = {
'region': ['North', 'South', 'North', 'East', 'South', 'East', 'North'],
'product': ['A', 'B', 'A', 'B', 'A', 'A', 'B'],
'sales': [150, 200, 175, 220, 190, 210, 165],
'units': [10, 15, 12, 18, 14, 16, 11]
}
df = pd.DataFrame(data)
# Calculate mean sales by region
mean_by_region = df.groupby('region')['sales'].mean()
print(mean_by_region)
Output:
region
East 215.0
North 163.333333
South 195.0
Name: sales, dtype: float64
The result is a Series with the grouping column as the index. To reset the index and return a DataFrame:
mean_df = df.groupby('region')['sales'].mean().reset_index()
mean_df.columns = ['region', 'avg_sales']
print(mean_df)
Multiple Column Grouping
Group by multiple columns to create hierarchical aggregations:
# Mean sales by region and product
multi_group = df.groupby(['region', 'product'])['sales'].mean()
print(multi_group)
Output:
region product
East A 210.0
B 220.0
North A 162.5
B 165.0
South A 190.0
B 200.0
Name: sales, dtype: float64
Convert the MultiIndex to columns for easier manipulation:
result = df.groupby(['region', 'product'])['sales'].mean().reset_index()
result.columns = ['region', 'product', 'avg_sales']
print(result)
Aggregating Multiple Columns
Calculate means across multiple numeric columns simultaneously:
# Mean of all numeric columns by region
multi_col_mean = df.groupby('region')[['sales', 'units']].mean()
print(multi_col_mean)
Output:
sales units
region
East 215.000000 17.0
North 163.333333 11.0
South 195.000000 14.5
Use agg() for more control over which functions apply to which columns:
# Different aggregations for different columns
result = df.groupby('region').agg({
'sales': 'mean',
'units': ['mean', 'sum']
})
print(result)
Named Aggregations
Pandas 0.25+ supports named aggregations for cleaner column names:
result = df.groupby('region').agg(
avg_sales=('sales', 'mean'),
avg_units=('units', 'mean'),
total_units=('units', 'sum'),
max_sales=('sales', 'max')
)
print(result)
Output:
avg_sales avg_units total_units max_sales
region
East 215.000000 17.0 34 220
North 163.333333 11.0 33 175
South 195.000000 14.5 29 200
This approach eliminates the need for column renaming after aggregation.
Custom Aggregation Functions
Apply custom functions using lambda expressions or defined functions:
# Calculate weighted average
def weighted_mean(group):
return np.average(group['sales'], weights=group['units'])
weighted_avg = df.groupby('region').apply(weighted_mean)
print(weighted_avg)
For multiple custom calculations:
def custom_stats(group):
return pd.Series({
'mean_sales': group['sales'].mean(),
'median_sales': group['sales'].median(),
'std_sales': group['sales'].std(),
'coefficient_variation': group['sales'].std() / group['sales'].mean()
})
stats = df.groupby('region').apply(custom_stats)
print(stats)
Handling Missing Values
GroupBy operations handle NaN values differently than standard operations:
# Data with missing values
data_missing = {
'category': ['A', 'B', 'A', 'B', 'A', 'B'],
'value': [10, 20, np.nan, 25, 15, np.nan]
}
df_missing = pd.DataFrame(data_missing)
# By default, mean() skips NaN values
print(df_missing.groupby('category')['value'].mean())
Output:
category
A 12.5
B 22.5
Name: value, dtype: float64
To include groups with all NaN values:
# Using transform to preserve original index
df_missing['group_mean'] = df_missing.groupby('category')['value'].transform('mean')
print(df_missing)
Performance Optimization
For large datasets, optimize GroupBy operations:
# Convert to categorical for faster grouping
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')
# Select only necessary columns before grouping
result = df[['region', 'sales']].groupby('region').mean()
# Sort before grouping if data is naturally ordered
df_sorted = df.sort_values('region')
result = df_sorted.groupby('region')['sales'].mean()
Benchmark different approaches:
import time
# Create larger dataset
large_df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000),
'value': np.random.randn(1000000)
})
# Method 1: Standard groupby
start = time.time()
result1 = large_df.groupby('category')['value'].mean()
print(f"Standard: {time.time() - start:.4f}s")
# Method 2: With categorical
large_df['category'] = large_df['category'].astype('category')
start = time.time()
result2 = large_df.groupby('category')['value'].mean()
print(f"Categorical: {time.time() - start:.4f}s")
Filtering Groups After Aggregation
Filter groups based on aggregated values:
# Only regions with average sales > 180
high_sales_regions = df.groupby('region')['sales'].mean()
high_sales_regions = high_sales_regions[high_sales_regions > 180]
print(high_sales_regions)
Use filter() to return original rows from qualifying groups:
# Return all rows from regions with mean sales > 180
filtered_df = df.groupby('region').filter(lambda x: x['sales'].mean() > 180)
print(filtered_df)
Combining GroupBy with Other Operations
Integrate GroupBy results back into the original DataFrame:
# Add group means as a new column
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
df['deviation_from_avg'] = df['sales'] - df['region_avg_sales']
print(df)
Calculate percentile ranks within groups:
df['percentile_rank'] = df.groupby('region')['sales'].rank(pct=True)
print(df[['region', 'sales', 'percentile_rank']])
The GroupBy-mean pattern forms the foundation for exploratory data analysis, reporting dashboards, and feature engineering in machine learning pipelines. Master these techniques to efficiently summarize and analyze categorical data at scale.