How to GroupBy and Count in Pandas
Counting things is the foundation of data analysis. Before you build models or create visualizations, you need to understand what's in your data: How many orders per customer? How many defects per...
Key Insights
- Use
size()when you need total row counts including NaN values; usecount()when you want to exclude missing data from your tallies value_counts()is the fastest path to frequency counts for a single column, butgroupby()offers more flexibility for multi-column analysis and combining with other aggregations- Always call
reset_index()on grouped results when you need a clean DataFrame for further analysis or export—grouped objects return Series with MultiIndex by default
Introduction
Counting things is the foundation of data analysis. Before you build models or create visualizations, you need to understand what’s in your data: How many orders per customer? How many defects per production line? How many transactions per day?
Pandas provides several methods for grouping and counting data, each with specific strengths. The problem is that newcomers often pick the wrong tool, leading to unexpected results—especially when NaN values are involved. This article cuts through the confusion and gives you a clear framework for choosing the right counting method.
We’ll cover the core approaches: size(), count(), value_counts(), and aggregation patterns. By the end, you’ll know exactly which method to reach for in any situation.
Basic GroupBy with size() vs count()
The most common mistake in Pandas counting is conflating size() and count(). They look similar but behave differently with missing data.
size() counts all rows, including those with NaN values. It returns the total number of rows in each group.
count() counts non-null values in each column. It excludes NaN entries from the tally.
Here’s the difference in action:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'South'],
'sales_rep': ['Alice', 'Bob', np.nan, 'Charlie', 'Diana'],
'revenue': [1000, 1500, 800, np.nan, 1200]
})
print("Original DataFrame:")
print(df)
print()
# size() counts ALL rows per group
print("Using size():")
print(df.groupby('region').size())
print()
# count() counts non-null values per column
print("Using count():")
print(df.groupby('region').count())
Output:
Original DataFrame:
region sales_rep revenue
0 North Alice 1000.0
1 North Bob 1500.0
2 South NaN 800.0
3 South Charlie NaN
4 South Diana 1200.0
Using size():
region
North 2
South 3
dtype: int64
Using count():
sales_rep revenue
region
North 2 2
South 2 2
Notice that size() returns a Series with the total row count (South has 3 rows), while count() returns a DataFrame showing non-null counts per column (South has only 2 non-null values in both sales_rep and revenue).
When to use each:
- Use
size()when you want the total number of records regardless of data quality - Use
count()when you need to know how many complete values exist in specific columns
Counting with value_counts()
For single-column frequency counts, value_counts() is the most direct approach. It’s essentially a shortcut for groupby().size() with automatic sorting.
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Widget', 'Gizmo', 'Widget', 'Gadget'],
'status': ['shipped', 'pending', 'shipped', 'shipped', 'returned', 'shipped']
})
# Basic frequency count
print("Product counts:")
print(df['product'].value_counts())
print()
# Get proportions instead of counts
print("Product proportions:")
print(df['product'].value_counts(normalize=True))
print()
# Sort by index instead of count
print("Sorted alphabetically:")
print(df['product'].value_counts().sort_index())
print()
# Include NaN values in the count
df_with_nan = df.copy()
df_with_nan.loc[0, 'product'] = np.nan
print("With dropna=False:")
print(df_with_nan['product'].value_counts(dropna=False))
Output:
Product counts:
product
Widget 3
Gadget 2
Gizmo 1
Name: count, dtype: int64
Product proportions:
product
Widget 0.500000
Gadget 0.333333
Gizmo 0.166667
Name: proportion, dtype: float64
Sorted alphabetically:
product
Gadget 2
Gizmo 1
Widget 3
Name: count, dtype: int64
With dropna=False:
product
Widget 2
Gadget 2
NaN 1
Gizmo 1
Name: count, dtype: int64
The normalize=True parameter is particularly useful for understanding distributions. The dropna=False parameter ensures you don’t accidentally hide missing data patterns.
Grouping by Multiple Columns
Real analysis often requires grouping by multiple dimensions. Pass a list of column names to groupby() to create hierarchical groups.
df = pd.DataFrame({
'year': [2023, 2023, 2023, 2024, 2024, 2024],
'quarter': ['Q1', 'Q1', 'Q2', 'Q1', 'Q1', 'Q2'],
'region': ['North', 'South', 'North', 'North', 'South', 'South'],
'orders': [150, 200, 175, 180, 220, 195]
})
# Group by multiple columns
grouped = df.groupby(['year', 'quarter']).size()
print("Grouped counts (Series with MultiIndex):")
print(grouped)
print()
# Convert to a clean DataFrame
grouped_df = df.groupby(['year', 'quarter']).size().reset_index(name='count')
print("As a DataFrame:")
print(grouped_df)
print()
# Three-level grouping
detailed = df.groupby(['year', 'quarter', 'region']).size().reset_index(name='count')
print("Three-level grouping:")
print(detailed)
Output:
Grouped counts (Series with MultiIndex):
year quarter
2023 Q1 2
Q2 1
2024 Q1 2
Q2 1
dtype: int64
As a DataFrame:
year quarter count
0 2023 Q1 2
1 2023 Q2 1
2 2024 Q1 2
3 2024 Q2 1
Three-level grouping:
year quarter region count
0 2023 Q1 North 1
1 2023 Q1 South 1
2 2023 Q2 North 1
3 2024 Q1 North 1
4 2024 Q1 South 1
5 2024 Q2 South 1
The reset_index(name='count') pattern is essential. Without it, you get a Series with a MultiIndex, which is harder to work with for most downstream tasks. The name parameter gives your count column a meaningful label.
Conditional Counting with Aggregation
When you need counts alongside other metrics, or when you need conditional counts, use agg().
df = pd.DataFrame({
'category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Electronics'],
'product_id': [101, 102, 201, 202, 103],
'revenue': [500, 750, 120, 85, 620],
'returned': [False, True, False, False, True]
})
# Count alongside other aggregations
summary = df.groupby('category').agg(
product_count=('product_id', 'count'),
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean')
)
print("Multiple aggregations:")
print(summary)
print()
# Conditional counting: count returns per category
returns_count = df.groupby('category').agg(
total_products=('product_id', 'count'),
returned_products=('returned', 'sum') # True=1, False=0
)
returns_count['return_rate'] = returns_count['returned_products'] / returns_count['total_products']
print("With conditional count:")
print(returns_count)
Output:
Multiple aggregations:
product_count total_revenue avg_revenue
category
Clothing 2 205 102.5
Electronics 3 1870 623.3
With conditional count:
total_products returned_products return_rate
category
Clothing 2 0 0.000000
Electronics 3 2 0.666667
The named aggregation syntax column_name=('source_column', 'function') keeps your output clean and self-documenting. For conditional counts, leverage boolean columns—summing them gives you the count of True values.
Reshaping Count Results
Grouped counts often need reshaping for reports or visualization. Pandas offers several tools for this.
df = pd.DataFrame({
'year': [2023, 2023, 2023, 2024, 2024, 2024, 2023, 2024],
'product': ['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A'],
'region': ['North', 'North', 'South', 'North', 'South', 'North', 'South', 'South']
})
# Using unstack() to pivot grouped results
pivot_unstacked = df.groupby(['year', 'product']).size().unstack(fill_value=0)
print("Using unstack():")
print(pivot_unstacked)
print()
# Using pd.crosstab() for direct cross-tabulation
cross = pd.crosstab(df['year'], df['product'])
print("Using crosstab():")
print(cross)
print()
# crosstab with margins (totals)
cross_margins = pd.crosstab(df['year'], df['product'], margins=True, margins_name='Total')
print("With margins:")
print(cross_margins)
print()
# Three-way crosstab using pivot_table
pivot = df.pivot_table(
index='year',
columns='product',
values='region',
aggfunc='count',
fill_value=0
)
print("Using pivot_table():")
print(pivot)
Output:
Using unstack():
product A B
year
2023 2 2
2024 2 2
Using crosstab():
product A B
year
2023 2 2
2024 2 2
With margins:
product A B Total
year
2023 2 2 4
2024 2 2 4
Total 4 4 8
Using pivot_table():
product A B
year
2023 2 2
2024 2 2
Choose your reshaping method based on context:
unstack(): When you already have a grouped Series and want to pivot one levelcrosstab(): When you need a quick frequency table with optional marginspivot_table(): When you need more control or are combining counts with other aggregations
Conclusion
Here’s your quick reference for choosing the right counting method:
| Scenario | Method |
|---|---|
| Total rows per group (including NaN) | groupby().size() |
| Non-null values per column per group | groupby().count() |
| Single column frequencies | value_counts() |
| Multi-column grouping | groupby(['col1', 'col2']).size() |
| Counts with other metrics | groupby().agg() |
| Cross-tabulation | pd.crosstab() or unstack() |
Remember these principles: use size() for total counts and count() for non-null counts. Always call reset_index() when you need a clean DataFrame. And when in doubt, start with value_counts() for quick exploration before moving to more complex groupby operations.