How to GroupBy and Count in Pandas

Counting things is the foundation of data analysis. Before you build models or create visualizations, you need to understand what's in your data: How many orders per customer? How many defects per...

Key Insights

  • Use size() when you need total row counts including NaN values; use count() when you want to exclude missing data from your tallies
  • value_counts() is the fastest path to frequency counts for a single column, but groupby() offers more flexibility for multi-column analysis and combining with other aggregations
  • Always call reset_index() on grouped results when you need a clean DataFrame for further analysis or export—grouped objects return Series with MultiIndex by default

Introduction

Counting things is the foundation of data analysis. Before you build models or create visualizations, you need to understand what’s in your data: How many orders per customer? How many defects per production line? How many transactions per day?

Pandas provides several methods for grouping and counting data, each with specific strengths. The problem is that newcomers often pick the wrong tool, leading to unexpected results—especially when NaN values are involved. This article cuts through the confusion and gives you a clear framework for choosing the right counting method.

We’ll cover the core approaches: size(), count(), value_counts(), and aggregation patterns. By the end, you’ll know exactly which method to reach for in any situation.

Basic GroupBy with size() vs count()

The most common mistake in Pandas counting is conflating size() and count(). They look similar but behave differently with missing data.

size() counts all rows, including those with NaN values. It returns the total number of rows in each group.

count() counts non-null values in each column. It excludes NaN entries from the tally.

Here’s the difference in action:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'South'],
    'sales_rep': ['Alice', 'Bob', np.nan, 'Charlie', 'Diana'],
    'revenue': [1000, 1500, 800, np.nan, 1200]
})

print("Original DataFrame:")
print(df)
print()

# size() counts ALL rows per group
print("Using size():")
print(df.groupby('region').size())
print()

# count() counts non-null values per column
print("Using count():")
print(df.groupby('region').count())

Output:

Original DataFrame:
  region sales_rep  revenue
0  North     Alice   1000.0
1  North       Bob   1500.0
2  South       NaN    800.0
3  South   Charlie      NaN
4  South     Diana   1200.0

Using size():
region
North    2
South    3
dtype: int64

Using count():
        sales_rep  revenue
region                    
North           2        2
South           2        2

Notice that size() returns a Series with the total row count (South has 3 rows), while count() returns a DataFrame showing non-null counts per column (South has only 2 non-null values in both sales_rep and revenue).

When to use each:

  • Use size() when you want the total number of records regardless of data quality
  • Use count() when you need to know how many complete values exist in specific columns

Counting with value_counts()

For single-column frequency counts, value_counts() is the most direct approach. It’s essentially a shortcut for groupby().size() with automatic sorting.

df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Widget', 'Gizmo', 'Widget', 'Gadget'],
    'status': ['shipped', 'pending', 'shipped', 'shipped', 'returned', 'shipped']
})

# Basic frequency count
print("Product counts:")
print(df['product'].value_counts())
print()

# Get proportions instead of counts
print("Product proportions:")
print(df['product'].value_counts(normalize=True))
print()

# Sort by index instead of count
print("Sorted alphabetically:")
print(df['product'].value_counts().sort_index())
print()

# Include NaN values in the count
df_with_nan = df.copy()
df_with_nan.loc[0, 'product'] = np.nan

print("With dropna=False:")
print(df_with_nan['product'].value_counts(dropna=False))

Output:

Product counts:
product
Widget    3
Gadget    2
Gizmo     1
Name: count, dtype: int64

Product proportions:
product
Widget    0.500000
Gadget    0.333333
Gizmo     0.166667
Name: proportion, dtype: float64

Sorted alphabetically:
product
Gadget    2
Gizmo     1
Widget    3
Name: count, dtype: int64

With dropna=False:
product
Widget    2
Gadget    2
NaN       1
Gizmo     1
Name: count, dtype: int64

The normalize=True parameter is particularly useful for understanding distributions. The dropna=False parameter ensures you don’t accidentally hide missing data patterns.

Grouping by Multiple Columns

Real analysis often requires grouping by multiple dimensions. Pass a list of column names to groupby() to create hierarchical groups.

df = pd.DataFrame({
    'year': [2023, 2023, 2023, 2024, 2024, 2024],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q1', 'Q1', 'Q2'],
    'region': ['North', 'South', 'North', 'North', 'South', 'South'],
    'orders': [150, 200, 175, 180, 220, 195]
})

# Group by multiple columns
grouped = df.groupby(['year', 'quarter']).size()
print("Grouped counts (Series with MultiIndex):")
print(grouped)
print()

# Convert to a clean DataFrame
grouped_df = df.groupby(['year', 'quarter']).size().reset_index(name='count')
print("As a DataFrame:")
print(grouped_df)
print()

# Three-level grouping
detailed = df.groupby(['year', 'quarter', 'region']).size().reset_index(name='count')
print("Three-level grouping:")
print(detailed)

Output:

Grouped counts (Series with MultiIndex):
year  quarter
2023  Q1         2
      Q2         1
2024  Q1         2
      Q2         1
dtype: int64

As a DataFrame:
   year quarter  count
0  2023      Q1      2
1  2023      Q2      1
2  2024      Q1      2
3  2024      Q2      1

Three-level grouping:
   year quarter region  count
0  2023      Q1  North      1
1  2023      Q1  South      1
2  2023      Q2  North      1
3  2024      Q1  North      1
4  2024      Q1  South      1
5  2024      Q2  South      1

The reset_index(name='count') pattern is essential. Without it, you get a Series with a MultiIndex, which is harder to work with for most downstream tasks. The name parameter gives your count column a meaningful label.

Conditional Counting with Aggregation

When you need counts alongside other metrics, or when you need conditional counts, use agg().

df = pd.DataFrame({
    'category': ['Electronics', 'Electronics', 'Clothing', 'Clothing', 'Electronics'],
    'product_id': [101, 102, 201, 202, 103],
    'revenue': [500, 750, 120, 85, 620],
    'returned': [False, True, False, False, True]
})

# Count alongside other aggregations
summary = df.groupby('category').agg(
    product_count=('product_id', 'count'),
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean')
)
print("Multiple aggregations:")
print(summary)
print()

# Conditional counting: count returns per category
returns_count = df.groupby('category').agg(
    total_products=('product_id', 'count'),
    returned_products=('returned', 'sum')  # True=1, False=0
)
returns_count['return_rate'] = returns_count['returned_products'] / returns_count['total_products']
print("With conditional count:")
print(returns_count)

Output:

Multiple aggregations:
             product_count  total_revenue  avg_revenue
category                                              
Clothing                 2            205        102.5
Electronics              3           1870        623.3

With conditional count:
             total_products  returned_products  return_rate
category                                                   
Clothing                  2                  0     0.000000
Electronics               3                  2     0.666667

The named aggregation syntax column_name=('source_column', 'function') keeps your output clean and self-documenting. For conditional counts, leverage boolean columns—summing them gives you the count of True values.

Reshaping Count Results

Grouped counts often need reshaping for reports or visualization. Pandas offers several tools for this.

df = pd.DataFrame({
    'year': [2023, 2023, 2023, 2024, 2024, 2024, 2023, 2024],
    'product': ['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A'],
    'region': ['North', 'North', 'South', 'North', 'South', 'North', 'South', 'South']
})

# Using unstack() to pivot grouped results
pivot_unstacked = df.groupby(['year', 'product']).size().unstack(fill_value=0)
print("Using unstack():")
print(pivot_unstacked)
print()

# Using pd.crosstab() for direct cross-tabulation
cross = pd.crosstab(df['year'], df['product'])
print("Using crosstab():")
print(cross)
print()

# crosstab with margins (totals)
cross_margins = pd.crosstab(df['year'], df['product'], margins=True, margins_name='Total')
print("With margins:")
print(cross_margins)
print()

# Three-way crosstab using pivot_table
pivot = df.pivot_table(
    index='year', 
    columns='product', 
    values='region', 
    aggfunc='count',
    fill_value=0
)
print("Using pivot_table():")
print(pivot)

Output:

Using unstack():
product  A  B
year         
2023     2  2
2024     2  2

Using crosstab():
product  A  B
year         
2023     2  2
2024     2  2

With margins:
product  A  B  Total
year                
2023     2  2      4
2024     2  2      4
Total    4  4      8

Using pivot_table():
product  A  B
year         
2023     2  2
2024     2  2

Choose your reshaping method based on context:

  • unstack(): When you already have a grouped Series and want to pivot one level
  • crosstab(): When you need a quick frequency table with optional margins
  • pivot_table(): When you need more control or are combining counts with other aggregations

Conclusion

Here’s your quick reference for choosing the right counting method:

Scenario Method
Total rows per group (including NaN) groupby().size()
Non-null values per column per group groupby().count()
Single column frequencies value_counts()
Multi-column grouping groupby(['col1', 'col2']).size()
Counts with other metrics groupby().agg()
Cross-tabulation pd.crosstab() or unstack()

Remember these principles: use size() for total counts and count() for non-null counts. Always call reset_index() when you need a clean DataFrame. And when in doubt, start with value_counts() for quick exploration before moving to more complex groupby operations.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.