Pandas - Value Counts with Examples

The `value_counts()` method is a fundamental Pandas operation that returns the frequency of unique values in a Series. By default, it returns counts in descending order and excludes NaN values.

Key Insights

  • The value_counts() method returns frequency distributions of unique values in a Series, with options to normalize, sort, and handle missing values differently
  • Combining value_counts() with groupby operations enables multi-dimensional frequency analysis across DataFrame columns
  • Performance optimization techniques like categorical data types and binning continuous variables can significantly speed up value counting operations on large datasets

Basic Value Counts Usage

The value_counts() method is a fundamental Pandas operation that returns the frequency of unique values in a Series. By default, it returns counts in descending order and excludes NaN values.

import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'product': ['laptop', 'mouse', 'keyboard', 'mouse', 'laptop', 
                'monitor', 'mouse', 'keyboard', 'laptop', 'mouse'],
    'status': ['active', 'inactive', 'active', 'active', None, 
               'active', 'active', 'inactive', 'active', 'active']
})

# Basic value counts
product_counts = df['product'].value_counts()
print(product_counts)

Output:

product
mouse       4
laptop      3
keyboard    2
monitor     1
Name: count, dtype: int64

The method returns a Series with the unique values as the index and their frequencies as values. The result is automatically sorted by count in descending order.

Handling Missing Values

By default, value_counts() excludes NaN values. Use the dropna parameter to include them in the count.

# Exclude NaN (default)
status_counts = df['status'].value_counts()
print("Excluding NaN:")
print(status_counts)

# Include NaN
status_counts_with_nan = df['status'].value_counts(dropna=False)
print("\nIncluding NaN:")
print(status_counts_with_nan)

Output:

Excluding NaN:
status
active      7
inactive    2
Name: count, dtype: int64

Including NaN:
status
active      7
inactive    2
NaN         1
Name: count, dtype: int64

Normalized Frequencies

Convert counts to proportions using the normalize parameter. This is particularly useful for understanding relative distributions.

# Get proportions instead of counts
product_proportions = df['product'].value_counts(normalize=True)
print(product_proportions)

# Format as percentages
product_percentages = df['product'].value_counts(normalize=True) * 100
print("\nAs percentages:")
print(product_percentages.round(2))

Output:

product
mouse       0.4
laptop      0.3
keyboard    0.2
monitor     0.1
Name: proportion, dtype: float64

As percentages:
product
mouse       40.0
laptop      30.0
keyboard    20.0
monitor     10.0
Name: proportion, dtype: float64

Sorting and Limiting Results

Control the sort order and limit the number of results returned using sort and ascending parameters.

# Sort by index (alphabetically)
alphabetical = df['product'].value_counts(sort=False)
print("Alphabetical order:")
print(alphabetical)

# Ascending order by count
ascending_counts = df['product'].value_counts(ascending=True)
print("\nAscending by count:")
print(ascending_counts)

# Top N results
top_2 = df['product'].value_counts().head(2)
print("\nTop 2 products:")
print(top_2)

Binning Continuous Variables

For numerical data, combine value_counts() with cut() or qcut() to create frequency distributions of binned values.

# Create sample numerical data
np.random.seed(42)
df_numeric = pd.DataFrame({
    'price': np.random.uniform(10, 1000, 100),
    'quantity': np.random.randint(1, 50, 100)
})

# Bin prices into categories
price_bins = pd.cut(df_numeric['price'], 
                    bins=[0, 100, 500, 1000], 
                    labels=['Low', 'Medium', 'High'])
price_distribution = price_bins.value_counts()
print("Price distribution:")
print(price_distribution)

# Equal-frequency binning
quantity_quartiles = pd.qcut(df_numeric['quantity'], 
                              q=4, 
                              labels=['Q1', 'Q2', 'Q3', 'Q4'])
quartile_distribution = quantity_quartiles.value_counts()
print("\nQuantity quartiles:")
print(quartile_distribution)

Multi-Column Value Counts

Analyze combinations of values across multiple columns by stacking them or using groupby operations.

# Create multi-column dataset
df_sales = pd.DataFrame({
    'region': ['North', 'South', 'North', 'East', 'South', 'North', 'East', 'South'],
    'product': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'B'],
    'status': ['Won', 'Lost', 'Won', 'Won', 'Won', 'Lost', 'Lost', 'Won']
})

# Count combinations of region and product
region_product_counts = df_sales.groupby(['region', 'product']).size()
print("Region-Product combinations:")
print(region_product_counts)

# Alternative: using value_counts on multiple columns (Pandas 1.1+)
combination_counts = df_sales[['region', 'product']].value_counts()
print("\nUsing value_counts on multiple columns:")
print(combination_counts)

Performance Optimization with Categorical Data

For columns with repeated string values, converting to categorical dtype significantly improves value_counts() performance.

import time

# Create large dataset
large_df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000000)
})

# Measure performance with object dtype
start = time.time()
counts_object = large_df['category'].value_counts()
time_object = time.time() - start

# Convert to categorical and measure
large_df['category'] = large_df['category'].astype('category')
start = time.time()
counts_categorical = large_df['category'].value_counts()
time_categorical = time.time() - start

print(f"Object dtype: {time_object:.4f} seconds")
print(f"Categorical dtype: {time_categorical:.4f} seconds")
print(f"Speedup: {time_object/time_categorical:.2f}x")

Conditional Value Counts

Filter data before counting or use boolean indexing to count values meeting specific criteria.

df_sales = pd.DataFrame({
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
    'revenue': [100, 200, 150, 300, 250, 120, 280, 220, 180, 310],
    'region': ['North', 'South', 'North', 'East', 'South', 
               'North', 'East', 'South', 'North', 'East']
})

# Count products with revenue > 200
high_revenue_products = df_sales[df_sales['revenue'] > 200]['product'].value_counts()
print("Products with revenue > 200:")
print(high_revenue_products)

# Count by region for specific products
north_products = df_sales[df_sales['region'] == 'North']['product'].value_counts()
print("\nProducts sold in North region:")
print(north_products)

Converting Results to DataFrame

Transform value_counts() output into a DataFrame for further analysis or export.

# Basic conversion
counts_df = df['product'].value_counts().reset_index()
counts_df.columns = ['product', 'count']
print(counts_df)

# With percentage
counts_df['percentage'] = (counts_df['count'] / counts_df['count'].sum() * 100).round(2)
print("\nWith percentages:")
print(counts_df)

# Convert to dictionary
counts_dict = df['product'].value_counts().to_dict()
print("\nAs dictionary:")
print(counts_dict)

Practical Application: Data Quality Checks

Use value_counts() to identify data quality issues like unexpected values or imbalanced distributions.

# Sample dataset with potential issues
df_quality = pd.DataFrame({
    'status': ['active', 'Active', 'ACTIVE', 'inactive', 'active', 
               'pending', 'active', 'Inactive', 'active', 'unknown'],
    'priority': [1, 2, 3, 1, 2, 99, 1, 2, 1, 2]
})

# Check for case inconsistencies
print("Status value distribution:")
print(df_quality['status'].value_counts())

# Identify outliers in priority
print("\nPriority distribution:")
print(df_quality['priority'].value_counts().sort_index())

# Clean and recount
df_quality['status_clean'] = df_quality['status'].str.lower()
print("\nCleaned status distribution:")
print(df_quality['status_clean'].value_counts())

The value_counts() method is essential for exploratory data analysis, data validation, and understanding distributions in your datasets. Mastering its parameters and combining it with other Pandas operations enables sophisticated frequency analysis workflows.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.