How to Use Agg with Multiple Functions in Pandas
Pandas provides convenient single-function aggregation methods like `sum()`, `mean()`, and `max()`. They work fine when you need one statistic. But real-world data analysis rarely stops at a single...
Key Insights
- The
agg()method lets you apply multiple aggregation functions in a single operation, eliminating repetitive code and improving performance over chained single-function calls. - Using dictionaries with
agg()gives you precise control over which functions apply to which columns, while named aggregations produce clean, readable output without post-processing. - Always flatten multi-level column indices after complex aggregations to maintain compatibility with downstream operations and avoid cryptic column access patterns.
Introduction to the Agg Method
Pandas provides convenient single-function aggregation methods like sum(), mean(), and max(). They work fine when you need one statistic. But real-world data analysis rarely stops at a single number. You typically want the sum and the mean and the standard deviation—all at once.
This is where agg() shines. Short for “aggregate,” this method accepts multiple functions and applies them in a single pass over your data. Instead of writing three separate lines and potentially iterating through your DataFrame multiple times, you declare everything upfront and let Pandas optimize the execution.
The method works on Series, DataFrames, and GroupBy objects, making it one of the most versatile tools in the Pandas arsenal. Once you internalize its patterns, you’ll find yourself reaching for it constantly.
Basic Syntax: Applying Multiple Functions to a Single Column
The simplest use case involves applying several aggregation functions to a single column. Pass a list of function names as strings, and Pandas returns a Series with each result labeled by its function name.
import pandas as pd
import numpy as np
# Sample sales data
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget'],
'sales': [150, 200, 175, 225, 160],
'quantity': [10, 15, 12, 18, 11]
})
# Apply multiple functions to a single column
sales_stats = df['sales'].agg(['sum', 'mean', 'max', 'min', 'std'])
print(sales_stats)
Output:
sum 910.000000
mean 182.000000
max 225.000000
min 150.000000
std 30.822070
Name: sales, dtype: float64
You can also apply these functions to an entire DataFrame. Each column gets each function applied:
# Apply to all numeric columns
full_stats = df[['sales', 'quantity']].agg(['sum', 'mean', 'std'])
print(full_stats)
Output:
sales quantity
sum 910.00 66.00
mean 182.00 13.20
std 30.82 3.27
This produces a DataFrame where rows represent functions and columns represent your original columns. It’s a compact summary that would require multiple method calls otherwise.
Applying Different Functions to Different Columns
Real datasets have columns with different meanings. Summing quantities makes sense; summing product IDs doesn’t. The dictionary syntax lets you specify exactly which functions apply to which columns.
# Different functions for different columns
summary = df.agg({
'sales': 'sum',
'quantity': ['mean', 'max', 'min']
})
print(summary)
Output:
sales quantity
sum 910.0 NaN
mean NaN 13.2
max NaN 18.0
min NaN 11.0
Notice the NaN values—they appear because ‘sum’ wasn’t requested for ‘quantity’ and vice versa. This sparse output can be awkward, but it accurately reflects your request.
For cleaner results when columns have different function sets, consider separate aggregations or use the named aggregation syntax covered later:
# More practical: aggregate with consistent functions per column
result = df.agg({
'sales': ['sum', 'mean', 'std'],
'quantity': ['sum', 'mean', 'std']
})
print(result)
This produces a complete matrix without NaN values, since both columns receive the same functions.
Using Custom Functions with Agg
Built-in functions cover common cases, but you’ll often need custom calculations. The agg() method accepts lambda functions, user-defined functions, and NumPy functions interchangeably.
# Custom function for range calculation
def value_range(x):
return x.max() - x.min()
# Lambda for 90th percentile
percentile_90 = lambda x: np.percentile(x, 90)
# Mix built-in and custom functions
custom_stats = df['sales'].agg([
'mean',
'std',
value_range,
percentile_90
])
print(custom_stats)
Output:
mean 182.0
std 30.8
value_range 75.0
percentile_90 212.0
Name: sales, dtype: float64
Pandas uses the function’s __name__ attribute for labeling. Lambda functions get labeled as <lambda>, which becomes problematic with multiple lambdas. Name your lambdas explicitly or use regular functions:
# Named lambda for clarity
percentile_90.__name__ = 'p90'
# Or define as a proper function
def p90(x):
"""90th percentile"""
return np.percentile(x, 90)
def p10(x):
"""10th percentile"""
return np.percentile(x, 10)
# Now the output is readable
percentile_stats = df['sales'].agg([p10, 'median', p90])
print(percentile_stats)
Custom functions receive the entire column (or group) as a Series, giving you full flexibility. Just ensure your function returns a scalar value for aggregation contexts.
Combining Agg with GroupBy
The real power of agg() emerges when combined with groupby(). This combination lets you compute multiple statistics for each group in a single, readable operation.
# Expanded dataset
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'South', 'North', 'East', 'East'],
'category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Electronics', 'Clothing', 'Electronics'],
'revenue': [1200, 1500, 800, 950, 1100, 700, 1300],
'orders': [10, 12, 8, 9, 11, 7, 13]
})
# Group by region and aggregate
regional_summary = df.groupby('region').agg({
'revenue': ['sum', 'mean', 'std'],
'orders': ['sum', 'count']
})
print(regional_summary)
Output:
revenue orders
sum mean std sum count
region
East 2000 1000.000000 424.264069 20 2
North 3100 1033.333333 208.166600 29 3
South 2450 1225.000000 388.908730 21 2
This produces a hierarchical column index (MultiIndex), with the original column name at the top level and function names below. It’s comprehensive but can be unwieldy for downstream processing.
You can also group by multiple columns:
# Multi-level grouping
detailed = df.groupby(['region', 'category']).agg({
'revenue': ['sum', 'mean'],
'orders': 'sum'
})
print(detailed)
This creates both a row MultiIndex (for groups) and a column MultiIndex (for aggregations), giving you a complete cross-tabulation of your data.
Renaming Aggregated Columns
The MultiIndex columns from grouped aggregations are technically correct but annoying to work with. You access them with tuples like df[('revenue', 'sum')], which is verbose and error-prone.
Named aggregations solve this elegantly. Use keyword arguments where each key becomes the output column name:
# Named aggregations for clean output
clean_summary = df.groupby('region').agg(
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean'),
revenue_std=('revenue', 'std'),
total_orders=('orders', 'sum'),
order_count=('orders', 'count')
)
print(clean_summary)
Output:
total_revenue avg_revenue revenue_std total_orders order_count
region
East 2000 1000.000000 424.264069 20 2
North 3100 1033.333333 208.166600 29 3
South 2450 1225.000000 388.908730 21 2
The output has flat, descriptive column names. No tuple indexing required. This syntax uses tuples of (column, function) as values, with your desired output name as the keyword.
For custom functions with named aggregations:
def coefficient_of_variation(x):
return x.std() / x.mean() * 100
cv_summary = df.groupby('region').agg(
total_revenue=('revenue', 'sum'),
revenue_cv=('revenue', coefficient_of_variation)
)
print(cv_summary)
This approach produces production-ready DataFrames that integrate cleanly with visualization libraries, export functions, and downstream analysis.
Practical Tips and Common Pitfalls
Handling NaN Values
By default, aggregation functions skip NaN values. This is usually what you want, but be aware of it:
df_with_nan = df.copy()
df_with_nan.loc[0, 'revenue'] = np.nan
# NaN is excluded from calculations
print(df_with_nan.groupby('region')['revenue'].agg(['sum', 'mean', 'count']))
The count reflects only non-null values. If you need to count all rows regardless of nulls, use 'size' instead of 'count'.
Flattening Multi-Level Column Indices
When you can’t use named aggregations (perhaps you’re working with legacy code or dynamic function lists), flatten the MultiIndex manually:
# Create aggregation with MultiIndex columns
result = df.groupby('region').agg({
'revenue': ['sum', 'mean'],
'orders': ['sum', 'count']
})
# Flatten column names
result.columns = ['_'.join(col).strip() for col in result.columns.values]
print(result.columns)
Output:
Index(['revenue_sum', 'revenue_mean', 'orders_sum', 'orders_count'], dtype='object')
Now you can access columns normally: result['revenue_sum'].
Performance Considerations
Using agg() with multiple functions is faster than calling each aggregation separately because Pandas can optimize the iteration. However, custom Python functions are slower than built-in aggregations. When performance matters:
# Prefer string names over function references
df.agg(['sum', 'mean']) # Faster: uses optimized C code
# Avoid when possible
df.agg([np.sum, np.mean]) # Slower: Python function overhead
For large datasets, consider whether you truly need all requested statistics. Each additional function adds computation time, and some (like std) are more expensive than others (like sum).
Resetting the Index
Grouped aggregations produce the grouping columns as an index. For most downstream work, you’ll want them as regular columns:
final_result = df.groupby('region').agg(
total_revenue=('revenue', 'sum')
).reset_index()
print(final_result)
This gives you a clean DataFrame ready for merging, exporting, or further analysis.
The agg() method transforms verbose, repetitive aggregation code into declarative specifications of what you want. Master its patterns, and you’ll write cleaner Pandas code that’s easier to read, maintain, and extend.