Pandas - GroupBy and Count

• GroupBy operations in Pandas enable efficient data aggregation by splitting data into groups based on categorical variables, applying functions, and combining results into a structured output

Key Insights

• GroupBy operations in Pandas enable efficient data aggregation by splitting data into groups based on categorical variables, applying functions, and combining results into a structured output • The size(), count(), and value_counts() methods each serve distinct purposes when counting grouped data—understanding their differences prevents common analytical errors • Combining GroupBy with multi-level indexing, named aggregations, and transformation functions unlocks advanced analytical patterns for real-world data processing tasks

Understanding GroupBy Mechanics

The GroupBy operation implements the split-apply-combine pattern. Pandas splits your DataFrame into groups based on one or more keys, applies a function to each group independently, and combines the results into a new data structure.

import pandas as pd
import numpy as np

# Sample sales data
data = {
    'region': ['North', 'North', 'South', 'South', 'East', 'East', 'North'],
    'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A'],
    'sales': [100, 150, 200, np.nan, 175, 225, 130],
    'units': [10, 15, 20, 25, 17, 22, 13]
}

df = pd.DataFrame(data)
print(df)

When you call groupby(), Pandas doesn’t immediately compute anything. It returns a GroupBy object that holds the grouping logic:

grouped = df.groupby('region')
print(type(grouped))  # <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

size() vs count() vs value_counts()

These three methods appear similar but handle missing values and return structures differently.

size() counts all rows in each group, including NaN values:

# Count total rows per region (includes NaN)
result = df.groupby('region').size()
print(result)
# region
# East     2
# North    3
# South    2
# dtype: int64

count() counts non-null values for each column in each group:

# Count non-null values per column per region
result = df.groupby('region').count()
print(result)
#        product  sales  units
# region                      
# East         2      2      2
# North        3      3      3
# South        2      1      2

Notice that South region shows only 1 for sales because one value is NaN.

value_counts() on a GroupBy object counts unique combinations:

# Count occurrences of each product within each region
result = df.groupby('region')['product'].value_counts()
print(result)
# region  product
# East    A          1
#         B          1
# North   A          2
#         B          1
# South   A          1
#         B          1
# Name: product, dtype: int64

Grouping by Multiple Columns

Real-world analysis often requires grouping by multiple dimensions:

# Count combinations of region and product
result = df.groupby(['region', 'product']).size()
print(result)
# region  product
# East    A          1
#         B          1
# North   A          2
#         B          1
# South   A          1
#         B          1
# dtype: int64

Convert multi-index results to a DataFrame for easier manipulation:

result = df.groupby(['region', 'product']).size().reset_index(name='count')
print(result)
#   region product  count
# 0   East       A      1
# 1   East       B      1
# 2  North       A      2
# 3  North       B      1
# 4  South       A      1
# 5  South       B      1

Named Aggregations with agg()

The agg() method provides fine-grained control over aggregations with named outputs:

result = df.groupby('region').agg(
    total_rows=('product', 'size'),
    non_null_sales=('sales', 'count'),
    total_sales=('sales', 'sum'),
    avg_units=('units', 'mean')
)
print(result)
#        total_rows  non_null_sales  total_sales  avg_units
# region                                                    
# East            2               2        400.0      19.50
# North           3               3        380.0      12.67
# South           2               1        200.0      22.50

Apply multiple aggregations to the same column:

result = df.groupby('product').agg(
    count=('sales', 'count'),
    total=('sales', 'sum'),
    mean=('sales', 'mean'),
    std=('sales', 'std')
)
print(result)
#         count  total   mean        std
# product                               
# A           3  475.0  158.33      50.33
# B           2  375.0  187.50      53.03

Conditional Counting with Custom Functions

Use custom functions within GroupBy for complex counting logic:

def count_high_value(series, threshold=150):
    """Count values above threshold"""
    return (series > threshold).sum()

result = df.groupby('region')['sales'].agg([
    'count',
    ('high_value_count', lambda x: count_high_value(x, 150))
])
print(result)
#        count  high_value_count
# region                        
# East       2                 2
# North      3                 0
# South      1                 1

Filtering Groups Based on Count

Filter groups that meet specific count criteria:

# Keep only regions with more than 2 transactions
result = df.groupby('region').filter(lambda x: len(x) > 2)
print(result)
#   region product  sales  units
# 0  North       A  100.0     10
# 1  North       B  150.0     15
# 6  North       A  130.0     13

Combine filtering with aggregation:

# Get counts only for regions with more than 1 product type
result = (df.groupby('region')
          .filter(lambda x: x['product'].nunique() > 1)
          .groupby('region')
          .size())
print(result)
# region
# East     2
# North    3
# South    2
# dtype: int64

Handling Missing Data in Counts

Control how GroupBy handles NaN values:

# Default: dropna=True (excludes NaN from grouping keys)
df_with_nan = df.copy()
df_with_nan.loc[2, 'region'] = np.nan

result = df_with_nan.groupby('region').size()
print(result)
# region
# East     2
# North    3
# South    1
# dtype: int64

# Include NaN as a group
result = df_with_nan.groupby('region', dropna=False).size()
print(result)
# region
# East     2
# North    3
# South    1
# NaN      1
# dtype: int64

Performance Optimization for Large Datasets

When working with large datasets, optimize GroupBy operations:

# Use categorical data types for grouping columns
df['region'] = df['region'].astype('category')
df['product'] = df['product'].astype('category')

# This reduces memory usage and speeds up grouping
%timeit df.groupby('region').size()

For counting unique values across groups efficiently:

# Faster than groupby + nunique for large datasets
result = df.groupby('region')['product'].agg(['count', 'nunique'])
print(result)
#        count  nunique
# region               
# East       2        2
# North      3        2
# South      2        2

Pivot Tables as GroupBy Alternative

Pivot tables provide an alternative syntax for counting grouped data:

# Count occurrences using pivot_table
result = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc='count',
    fill_value=0
)
print(result)
# product  A  B
# region       
# East     1  1
# North    2  1
# South    1  1

Cumulative Counts Within Groups

Calculate running counts within each group:

df_sorted = df.sort_values(['region', 'product'])
df_sorted['cumulative_count'] = df_sorted.groupby('region').cumcount() + 1
print(df_sorted[['region', 'product', 'cumulative_count']])
#   region product  cumulative_count
# 4   East       A                 1
# 5   East       B                 2
# 0  North       A                 1
# 6  North       A                 2
# 1  North       B                 3
# 2  South       A                 1
# 3  South       B                 2

GroupBy counting operations form the foundation of data analysis workflows. Master these patterns to efficiently aggregate, filter, and transform grouped data in production environments.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.