Pandas - GroupBy and Max/Min
The fundamental pattern for finding maximum and minimum values within groups starts with the `groupby()` method followed by `max()` or `min()` aggregation functions.
Key Insights
- GroupBy operations with max/min are essential for finding extreme values within categorical data, enabling rapid identification of top performers, outliers, and boundary conditions across segments
- Pandas provides multiple approaches to retrieve maximum and minimum values: direct aggregation methods,
idxmax()/idxmin()for index positions, andnlargest()/nsmallest()for top-n analysis within groups - Combining groupby with max/min requires understanding how to handle multiple columns, preserve associated row data, and manage edge cases like ties and missing values
Basic GroupBy Max and Min Operations
The fundamental pattern for finding maximum and minimum values within groups starts with the groupby() method followed by max() or min() aggregation functions.
import pandas as pd
import numpy as np
# Sample sales data
data = {
'region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'revenue': [15000, 23000, 18000, 21000, 19000, 25000, 17000, 22000],
'units_sold': [150, 180, 160, 170, 155, 200, 145, 190]
}
df = pd.DataFrame(data)
# Find maximum revenue by region
max_revenue = df.groupby('region')['revenue'].max()
print(max_revenue)
# Output:
# region
# East 25000
# North 23000
# South 21000
# West 22000
# Find minimum units sold by region
min_units = df.groupby('region')['units_sold'].min()
print(min_units)
# Output:
# region
# East 155
# North 150
# South 160
# West 145
This approach returns a Series with the group keys as the index. For multiple columns, pass a list to the groupby selection:
# Get max and min for multiple columns
result = df.groupby('region')[['revenue', 'units_sold']].agg(['max', 'min'])
print(result)
Retrieving Complete Rows with Maximum/Minimum Values
Finding the max or min value is often insufficient—you typically need the entire row containing that extreme value. The idxmax() and idxmin() methods return the index labels of maximum and minimum values.
# Get index of maximum revenue per region
max_idx = df.groupby('region')['revenue'].idxmax()
print(max_idx)
# Output:
# region
# East 5
# North 1
# South 3
# West 7
# Retrieve complete rows with maximum revenue
max_revenue_rows = df.loc[max_idx]
print(max_revenue_rows)
This pattern is critical for maintaining data relationships. Here’s a practical example with employee data:
employees = pd.DataFrame({
'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Marketing', 'Marketing'],
'employee_id': [101, 102, 201, 202, 301, 302],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'salary': [75000, 82000, 95000, 88000, 70000, 73000],
'years_experience': [3, 5, 7, 4, 2, 3]
})
# Find highest paid employee in each department
highest_paid_idx = employees.groupby('department')['salary'].idxmax()
top_earners = employees.loc[highest_paid_idx]
print(top_earners[['department', 'name', 'salary']])
# Output:
# department name salary
# 1 Sales Bob 82000
# 2 Engineering Charlie 95000
# 5 Marketing Frank 73000
Multiple Grouping Columns and Aggregations
Real-world scenarios often require grouping by multiple columns and computing various aggregations simultaneously.
# Complex sales data
sales_data = pd.DataFrame({
'year': [2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
'region': ['North', 'North', 'South', 'South', 'North', 'North', 'South', 'South'],
'revenue': [100000, 120000, 95000, 110000, 125000, 140000, 105000, 115000],
'expenses': [60000, 70000, 55000, 65000, 72000, 80000, 58000, 68000]
})
# Group by multiple columns and get max/min
result = sales_data.groupby(['year', 'region']).agg({
'revenue': ['max', 'min', 'mean'],
'expenses': ['max', 'min']
})
print(result)
For custom aggregation logic:
# Custom aggregation with named outputs
agg_result = sales_data.groupby(['year', 'region']).agg(
max_revenue=('revenue', 'max'),
min_revenue=('revenue', 'min'),
profit_range=('revenue', lambda x: x.max() - x.min()),
max_expense_ratio=('expenses', lambda x: (x.max() / sales_data.loc[x.index, 'revenue'].max()))
)
print(agg_result)
Top N Values Per Group
Beyond single maximum or minimum values, nlargest() and nsmallest() extract multiple extreme values within each group.
# Student test scores
scores = pd.DataFrame({
'subject': ['Math', 'Math', 'Math', 'Math', 'Science', 'Science', 'Science', 'Science'],
'student': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob', 'Charlie', 'David'],
'score': [95, 87, 92, 78, 88, 94, 85, 91]
})
# Top 2 scores per subject
top_2_per_subject = scores.groupby('subject').apply(
lambda x: x.nlargest(2, 'score')
).reset_index(drop=True)
print(top_2_per_subject)
# Output:
# subject student score
# 0 Math Alice 95
# 1 Math Charlie 92
# 2 Science Bob 94
# 3 Science David 91
# Alternative using sort and head
top_2_alternative = scores.sort_values('score', ascending=False).groupby('subject').head(2)
print(top_2_alternative)
Handling Edge Cases and Missing Data
Missing values and ties require careful handling to prevent unexpected results.
# Data with missing values and ties
data_with_na = pd.DataFrame({
'category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'value': [10, np.nan, 15, 20, 20, 18, np.nan, np.nan],
'metric': [100, 200, 150, 300, 300, 250, 400, 450]
})
# Max ignores NaN by default
max_values = data_with_na.groupby('category')['value'].max()
print(max_values)
# Output:
# category
# A 15.0
# B 20.0
# C NaN
# Handle missing values explicitly
max_with_handling = data_with_na.groupby('category')['value'].max(skipna=True)
# For ties, idxmax returns first occurrence
ties_idx = data_with_na.groupby('category')['metric'].idxmax()
print(data_with_na.loc[ties_idx])
# Get all rows with maximum value (including ties)
def get_all_max_rows(group):
max_val = group['metric'].max()
return group[group['metric'] == max_val]
all_max_rows = data_with_na.groupby('category').apply(get_all_max_rows).reset_index(drop=True)
print(all_max_rows)
Performance Optimization for Large Datasets
When working with large datasets, optimization strategies become critical.
# Create large dataset
np.random.seed(42)
large_df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000000),
'subcategory': np.random.choice(range(100), 1000000),
'value': np.random.randn(1000000),
'amount': np.random.randint(1, 1000, 1000000)
})
# Method 1: Direct groupby (faster for simple aggregations)
%timeit large_df.groupby('category')['value'].max()
# Method 2: Using transform for filtering
max_values = large_df.groupby('category')['value'].transform('max')
result = large_df[large_df['value'] == max_values]
# Method 3: Sort and drop duplicates (efficient for getting complete rows)
%timeit large_df.sort_values('value', ascending=False).drop_duplicates('category')
# For multiple groups, consider categorical dtype
large_df['category'] = large_df['category'].astype('category')
%timeit large_df.groupby('category')['value'].max()
Conditional Max/Min with Filtering
Combining groupby operations with conditional logic enables sophisticated analysis.
# Sales data with conditions
conditional_data = pd.DataFrame({
'store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'product_type': ['Electronics', 'Clothing', 'Electronics',
'Electronics', 'Clothing', 'Food',
'Electronics', 'Clothing', 'Food'],
'revenue': [5000, 3000, 5500, 4800, 3200, 2000, 5200, 3100, 2100],
'date': pd.date_range('2023-01-01', periods=9, freq='D')
})
# Max revenue for Electronics only, by store
electronics_max = conditional_data[
conditional_data['product_type'] == 'Electronics'
].groupby('store')['revenue'].max()
print(electronics_max)
# Max revenue per store where revenue > 3000
filtered_max = conditional_data[
conditional_data['revenue'] > 3000
].groupby('store')['revenue'].agg(['max', 'count'])
print(filtered_max)
These patterns form the foundation for extracting meaningful insights from grouped data, enabling you to identify extremes, outliers, and key data points across categorical segments efficiently.