Pandas - GroupBy and Count
• GroupBy operations in Pandas enable efficient data aggregation by splitting data into groups based on categorical variables, applying functions, and combining results into a structured output
Key Insights
• GroupBy operations in Pandas enable efficient data aggregation by splitting data into groups based on categorical variables, applying functions, and combining results into a structured output
• The size(), count(), and value_counts() methods each serve distinct purposes when counting grouped data—understanding their differences prevents common analytical errors
• Combining GroupBy with multi-level indexing, named aggregations, and transformation functions unlocks advanced analytical patterns for real-world data processing tasks
Understanding GroupBy Mechanics
The GroupBy operation implements the split-apply-combine pattern. Pandas splits your DataFrame into groups based on one or more keys, applies a function to each group independently, and combines the results into a new data structure.
import pandas as pd
import numpy as np
# Sample sales data
data = {
'region': ['North', 'North', 'South', 'South', 'East', 'East', 'North'],
'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A'],
'sales': [100, 150, 200, np.nan, 175, 225, 130],
'units': [10, 15, 20, 25, 17, 22, 13]
}
df = pd.DataFrame(data)
print(df)
When you call groupby(), Pandas doesn’t immediately compute anything. It returns a GroupBy object that holds the grouping logic:
grouped = df.groupby('region')
print(type(grouped)) # <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
size() vs count() vs value_counts()
These three methods appear similar but handle missing values and return structures differently.
size() counts all rows in each group, including NaN values:
# Count total rows per region (includes NaN)
result = df.groupby('region').size()
print(result)
# region
# East 2
# North 3
# South 2
# dtype: int64
count() counts non-null values for each column in each group:
# Count non-null values per column per region
result = df.groupby('region').count()
print(result)
# product sales units
# region
# East 2 2 2
# North 3 3 3
# South 2 1 2
Notice that South region shows only 1 for sales because one value is NaN.
value_counts() on a GroupBy object counts unique combinations:
# Count occurrences of each product within each region
result = df.groupby('region')['product'].value_counts()
print(result)
# region product
# East A 1
# B 1
# North A 2
# B 1
# South A 1
# B 1
# Name: product, dtype: int64
Grouping by Multiple Columns
Real-world analysis often requires grouping by multiple dimensions:
# Count combinations of region and product
result = df.groupby(['region', 'product']).size()
print(result)
# region product
# East A 1
# B 1
# North A 2
# B 1
# South A 1
# B 1
# dtype: int64
Convert multi-index results to a DataFrame for easier manipulation:
result = df.groupby(['region', 'product']).size().reset_index(name='count')
print(result)
# region product count
# 0 East A 1
# 1 East B 1
# 2 North A 2
# 3 North B 1
# 4 South A 1
# 5 South B 1
Named Aggregations with agg()
The agg() method provides fine-grained control over aggregations with named outputs:
result = df.groupby('region').agg(
total_rows=('product', 'size'),
non_null_sales=('sales', 'count'),
total_sales=('sales', 'sum'),
avg_units=('units', 'mean')
)
print(result)
# total_rows non_null_sales total_sales avg_units
# region
# East 2 2 400.0 19.50
# North 3 3 380.0 12.67
# South 2 1 200.0 22.50
Apply multiple aggregations to the same column:
result = df.groupby('product').agg(
count=('sales', 'count'),
total=('sales', 'sum'),
mean=('sales', 'mean'),
std=('sales', 'std')
)
print(result)
# count total mean std
# product
# A 3 475.0 158.33 50.33
# B 2 375.0 187.50 53.03
Conditional Counting with Custom Functions
Use custom functions within GroupBy for complex counting logic:
def count_high_value(series, threshold=150):
"""Count values above threshold"""
return (series > threshold).sum()
result = df.groupby('region')['sales'].agg([
'count',
('high_value_count', lambda x: count_high_value(x, 150))
])
print(result)
# count high_value_count
# region
# East 2 2
# North 3 0
# South 1 1
Filtering Groups Based on Count
Filter groups that meet specific count criteria:
# Keep only regions with more than 2 transactions
result = df.groupby('region').filter(lambda x: len(x) > 2)
print(result)
# region product sales units
# 0 North A 100.0 10
# 1 North B 150.0 15
# 6 North A 130.0 13
Combine filtering with aggregation:
# Get counts only for regions with more than 1 product type
result = (df.groupby('region')
.filter(lambda x: x['product'].nunique() > 1)
.groupby('region')
.size())
print(result)
# region
# East 2
# North 3
# South 2
# dtype: int64
Handling Missing Data in Counts
Control how GroupBy handles NaN values:
# Default: dropna=True (excludes NaN from grouping keys)
df_with_nan = df.copy()
df_with_nan.loc[2, 'region'] = np.nan
result = df_with_nan.groupby('region').size()
print(result)
# region
# East 2
# North 3
# South 1
# dtype: int64
# Include NaN as a group
result = df_with_nan.groupby('region', dropna=False).size()
print(result)
# region
# East 2
# North 3
# South 1
# NaN 1
# dtype: int64
Performance Optimization for Large Datasets
When working with large datasets, optimize GroupBy operations:
# Use categorical data types for grouping columns
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')
# This reduces memory usage and speeds up grouping
%timeit df.groupby('region').size()
For counting unique values across groups efficiently:
# Faster than groupby + nunique for large datasets
result = df.groupby('region')['product'].agg(['count', 'nunique'])
print(result)
# count nunique
# region
# East 2 2
# North 3 2
# South 2 2
Pivot Tables as GroupBy Alternative
Pivot tables provide an alternative syntax for counting grouped data:
# Count occurrences using pivot_table
result = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc='count',
fill_value=0
)
print(result)
# product A B
# region
# East 1 1
# North 2 1
# South 1 1
Cumulative Counts Within Groups
Calculate running counts within each group:
df_sorted = df.sort_values(['region', 'product'])
df_sorted['cumulative_count'] = df_sorted.groupby('region').cumcount() + 1
print(df_sorted[['region', 'product', 'cumulative_count']])
# region product cumulative_count
# 4 East A 1
# 5 East B 2
# 0 North A 1
# 6 North A 2
# 1 North B 3
# 2 South A 1
# 3 South B 2
GroupBy counting operations form the foundation of data analysis workflows. Master these patterns to efficiently aggregate, filter, and transform grouped data in production environments.