Pandas - GroupBy and Max/Min | Application Architect

Key Insights

GroupBy operations with max/min are essential for finding extreme values within categorical data, enabling rapid identification of top performers, outliers, and boundary conditions across segments
Pandas provides multiple approaches to retrieve maximum and minimum values: direct aggregation methods, idxmax()/idxmin() for index positions, and nlargest()/nsmallest() for top-n analysis within groups
Combining groupby with max/min requires understanding how to handle multiple columns, preserve associated row data, and manage edge cases like ties and missing values

Basic GroupBy Max and Min Operations

The fundamental pattern for finding maximum and minimum values within groups starts with the groupby() method followed by max() or min() aggregation functions.

import pandas as pd
import numpy as np

# Sample sales data
data = {
    'region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'revenue': [15000, 23000, 18000, 21000, 19000, 25000, 17000, 22000],
    'units_sold': [150, 180, 160, 170, 155, 200, 145, 190]
}

df = pd.DataFrame(data)

# Find maximum revenue by region
max_revenue = df.groupby('region')['revenue'].max()
print(max_revenue)
# Output:
# region
# East     25000
# North    23000
# South    21000
# West     22000

# Find minimum units sold by region
min_units = df.groupby('region')['units_sold'].min()
print(min_units)
# Output:
# region
# East     155
# North    150
# South    160
# West     145

This approach returns a Series with the group keys as the index. For multiple columns, pass a list to the groupby selection:

# Get max and min for multiple columns
result = df.groupby('region')[['revenue', 'units_sold']].agg(['max', 'min'])
print(result)

Retrieving Complete Rows with Maximum/Minimum Values

Finding the max or min value is often insufficient—you typically need the entire row containing that extreme value. The idxmax() and idxmin() methods return the index labels of maximum and minimum values.

# Get index of maximum revenue per region
max_idx = df.groupby('region')['revenue'].idxmax()
print(max_idx)
# Output:
# region
# East     5
# North    1
# South    3
# West     7

# Retrieve complete rows with maximum revenue
max_revenue_rows = df.loc[max_idx]
print(max_revenue_rows)

This pattern is critical for maintaining data relationships. Here’s a practical example with employee data:

employees = pd.DataFrame({
    'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Marketing', 'Marketing'],
    'employee_id': [101, 102, 201, 202, 301, 302],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'salary': [75000, 82000, 95000, 88000, 70000, 73000],
    'years_experience': [3, 5, 7, 4, 2, 3]
})

# Find highest paid employee in each department
highest_paid_idx = employees.groupby('department')['salary'].idxmax()
top_earners = employees.loc[highest_paid_idx]
print(top_earners[['department', 'name', 'salary']])
# Output:
#      department     name  salary
# 1         Sales      Bob   82000
# 2  Engineering  Charlie   95000
# 5     Marketing    Frank   73000

Multiple Grouping Columns and Aggregations

Real-world scenarios often require grouping by multiple columns and computing various aggregations simultaneously.

# Complex sales data
sales_data = pd.DataFrame({
    'year': [2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'region': ['North', 'North', 'South', 'South', 'North', 'North', 'South', 'South'],
    'revenue': [100000, 120000, 95000, 110000, 125000, 140000, 105000, 115000],
    'expenses': [60000, 70000, 55000, 65000, 72000, 80000, 58000, 68000]
})

# Group by multiple columns and get max/min
result = sales_data.groupby(['year', 'region']).agg({
    'revenue': ['max', 'min', 'mean'],
    'expenses': ['max', 'min']
})
print(result)

For custom aggregation logic:

# Custom aggregation with named outputs
agg_result = sales_data.groupby(['year', 'region']).agg(
    max_revenue=('revenue', 'max'),
    min_revenue=('revenue', 'min'),
    profit_range=('revenue', lambda x: x.max() - x.min()),
    max_expense_ratio=('expenses', lambda x: (x.max() / sales_data.loc[x.index, 'revenue'].max()))
)
print(agg_result)

Top N Values Per Group

Beyond single maximum or minimum values, nlargest() and nsmallest() extract multiple extreme values within each group.

# Student test scores
scores = pd.DataFrame({
    'subject': ['Math', 'Math', 'Math', 'Math', 'Science', 'Science', 'Science', 'Science'],
    'student': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob', 'Charlie', 'David'],
    'score': [95, 87, 92, 78, 88, 94, 85, 91]
})

# Top 2 scores per subject
top_2_per_subject = scores.groupby('subject').apply(
    lambda x: x.nlargest(2, 'score')
).reset_index(drop=True)
print(top_2_per_subject)
# Output:
#   subject  student  score
# 0    Math    Alice     95
# 1    Math  Charlie     92
# 2 Science      Bob     94
# 3 Science    David     91

# Alternative using sort and head
top_2_alternative = scores.sort_values('score', ascending=False).groupby('subject').head(2)
print(top_2_alternative)

Handling Edge Cases and Missing Data

Missing values and ties require careful handling to prevent unexpected results.

# Data with missing values and ties
data_with_na = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'value': [10, np.nan, 15, 20, 20, 18, np.nan, np.nan],
    'metric': [100, 200, 150, 300, 300, 250, 400, 450]
})

# Max ignores NaN by default
max_values = data_with_na.groupby('category')['value'].max()
print(max_values)
# Output:
# category
# A    15.0
# B    20.0
# C     NaN

# Handle missing values explicitly
max_with_handling = data_with_na.groupby('category')['value'].max(skipna=True)

# For ties, idxmax returns first occurrence
ties_idx = data_with_na.groupby('category')['metric'].idxmax()
print(data_with_na.loc[ties_idx])

# Get all rows with maximum value (including ties)
def get_all_max_rows(group):
    max_val = group['metric'].max()
    return group[group['metric'] == max_val]

all_max_rows = data_with_na.groupby('category').apply(get_all_max_rows).reset_index(drop=True)
print(all_max_rows)

Performance Optimization for Large Datasets

When working with large datasets, optimization strategies become critical.

# Create large dataset
np.random.seed(42)
large_df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000000),
    'subcategory': np.random.choice(range(100), 1000000),
    'value': np.random.randn(1000000),
    'amount': np.random.randint(1, 1000, 1000000)
})

# Method 1: Direct groupby (faster for simple aggregations)
%timeit large_df.groupby('category')['value'].max()

# Method 2: Using transform for filtering
max_values = large_df.groupby('category')['value'].transform('max')
result = large_df[large_df['value'] == max_values]

# Method 3: Sort and drop duplicates (efficient for getting complete rows)
%timeit large_df.sort_values('value', ascending=False).drop_duplicates('category')

# For multiple groups, consider categorical dtype
large_df['category'] = large_df['category'].astype('category')
%timeit large_df.groupby('category')['value'].max()

Conditional Max/Min with Filtering

Combining groupby operations with conditional logic enables sophisticated analysis.

# Sales data with conditions
conditional_data = pd.DataFrame({
    'store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'product_type': ['Electronics', 'Clothing', 'Electronics', 
                     'Electronics', 'Clothing', 'Food',
                     'Electronics', 'Clothing', 'Food'],
    'revenue': [5000, 3000, 5500, 4800, 3200, 2000, 5200, 3100, 2100],
    'date': pd.date_range('2023-01-01', periods=9, freq='D')
})

# Max revenue for Electronics only, by store
electronics_max = conditional_data[
    conditional_data['product_type'] == 'Electronics'
].groupby('store')['revenue'].max()
print(electronics_max)

# Max revenue per store where revenue > 3000
filtered_max = conditional_data[
    conditional_data['revenue'] > 3000
].groupby('store')['revenue'].agg(['max', 'count'])
print(filtered_max)

These patterns form the foundation for extracting meaningful insights from grouped data, enabling you to identify extremes, outliers, and key data points across categorical segments efficiently.