Pandas - Value Counts with Examples
The `value_counts()` method is a fundamental Pandas operation that returns the frequency of unique values in a Series. By default, it returns counts in descending order and excludes NaN values.
Key Insights
- The
value_counts()method returns frequency distributions of unique values in a Series, with options to normalize, sort, and handle missing values differently - Combining
value_counts()with groupby operations enables multi-dimensional frequency analysis across DataFrame columns - Performance optimization techniques like categorical data types and binning continuous variables can significantly speed up value counting operations on large datasets
Basic Value Counts Usage
The value_counts() method is a fundamental Pandas operation that returns the frequency of unique values in a Series. By default, it returns counts in descending order and excludes NaN values.
import pandas as pd
import numpy as np
# Create sample data
df = pd.DataFrame({
'product': ['laptop', 'mouse', 'keyboard', 'mouse', 'laptop',
'monitor', 'mouse', 'keyboard', 'laptop', 'mouse'],
'status': ['active', 'inactive', 'active', 'active', None,
'active', 'active', 'inactive', 'active', 'active']
})
# Basic value counts
product_counts = df['product'].value_counts()
print(product_counts)
Output:
product
mouse 4
laptop 3
keyboard 2
monitor 1
Name: count, dtype: int64
The method returns a Series with the unique values as the index and their frequencies as values. The result is automatically sorted by count in descending order.
Handling Missing Values
By default, value_counts() excludes NaN values. Use the dropna parameter to include them in the count.
# Exclude NaN (default)
status_counts = df['status'].value_counts()
print("Excluding NaN:")
print(status_counts)
# Include NaN
status_counts_with_nan = df['status'].value_counts(dropna=False)
print("\nIncluding NaN:")
print(status_counts_with_nan)
Output:
Excluding NaN:
status
active 7
inactive 2
Name: count, dtype: int64
Including NaN:
status
active 7
inactive 2
NaN 1
Name: count, dtype: int64
Normalized Frequencies
Convert counts to proportions using the normalize parameter. This is particularly useful for understanding relative distributions.
# Get proportions instead of counts
product_proportions = df['product'].value_counts(normalize=True)
print(product_proportions)
# Format as percentages
product_percentages = df['product'].value_counts(normalize=True) * 100
print("\nAs percentages:")
print(product_percentages.round(2))
Output:
product
mouse 0.4
laptop 0.3
keyboard 0.2
monitor 0.1
Name: proportion, dtype: float64
As percentages:
product
mouse 40.0
laptop 30.0
keyboard 20.0
monitor 10.0
Name: proportion, dtype: float64
Sorting and Limiting Results
Control the sort order and limit the number of results returned using sort and ascending parameters.
# Sort by index (alphabetically)
alphabetical = df['product'].value_counts(sort=False)
print("Alphabetical order:")
print(alphabetical)
# Ascending order by count
ascending_counts = df['product'].value_counts(ascending=True)
print("\nAscending by count:")
print(ascending_counts)
# Top N results
top_2 = df['product'].value_counts().head(2)
print("\nTop 2 products:")
print(top_2)
Binning Continuous Variables
For numerical data, combine value_counts() with cut() or qcut() to create frequency distributions of binned values.
# Create sample numerical data
np.random.seed(42)
df_numeric = pd.DataFrame({
'price': np.random.uniform(10, 1000, 100),
'quantity': np.random.randint(1, 50, 100)
})
# Bin prices into categories
price_bins = pd.cut(df_numeric['price'],
bins=[0, 100, 500, 1000],
labels=['Low', 'Medium', 'High'])
price_distribution = price_bins.value_counts()
print("Price distribution:")
print(price_distribution)
# Equal-frequency binning
quantity_quartiles = pd.qcut(df_numeric['quantity'],
q=4,
labels=['Q1', 'Q2', 'Q3', 'Q4'])
quartile_distribution = quantity_quartiles.value_counts()
print("\nQuantity quartiles:")
print(quartile_distribution)
Multi-Column Value Counts
Analyze combinations of values across multiple columns by stacking them or using groupby operations.
# Create multi-column dataset
df_sales = pd.DataFrame({
'region': ['North', 'South', 'North', 'East', 'South', 'North', 'East', 'South'],
'product': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'B'],
'status': ['Won', 'Lost', 'Won', 'Won', 'Won', 'Lost', 'Lost', 'Won']
})
# Count combinations of region and product
region_product_counts = df_sales.groupby(['region', 'product']).size()
print("Region-Product combinations:")
print(region_product_counts)
# Alternative: using value_counts on multiple columns (Pandas 1.1+)
combination_counts = df_sales[['region', 'product']].value_counts()
print("\nUsing value_counts on multiple columns:")
print(combination_counts)
Performance Optimization with Categorical Data
For columns with repeated string values, converting to categorical dtype significantly improves value_counts() performance.
import time
# Create large dataset
large_df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000000)
})
# Measure performance with object dtype
start = time.time()
counts_object = large_df['category'].value_counts()
time_object = time.time() - start
# Convert to categorical and measure
large_df['category'] = large_df['category'].astype('category')
start = time.time()
counts_categorical = large_df['category'].value_counts()
time_categorical = time.time() - start
print(f"Object dtype: {time_object:.4f} seconds")
print(f"Categorical dtype: {time_categorical:.4f} seconds")
print(f"Speedup: {time_object/time_categorical:.2f}x")
Conditional Value Counts
Filter data before counting or use boolean indexing to count values meeting specific criteria.
df_sales = pd.DataFrame({
'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
'revenue': [100, 200, 150, 300, 250, 120, 280, 220, 180, 310],
'region': ['North', 'South', 'North', 'East', 'South',
'North', 'East', 'South', 'North', 'East']
})
# Count products with revenue > 200
high_revenue_products = df_sales[df_sales['revenue'] > 200]['product'].value_counts()
print("Products with revenue > 200:")
print(high_revenue_products)
# Count by region for specific products
north_products = df_sales[df_sales['region'] == 'North']['product'].value_counts()
print("\nProducts sold in North region:")
print(north_products)
Converting Results to DataFrame
Transform value_counts() output into a DataFrame for further analysis or export.
# Basic conversion
counts_df = df['product'].value_counts().reset_index()
counts_df.columns = ['product', 'count']
print(counts_df)
# With percentage
counts_df['percentage'] = (counts_df['count'] / counts_df['count'].sum() * 100).round(2)
print("\nWith percentages:")
print(counts_df)
# Convert to dictionary
counts_dict = df['product'].value_counts().to_dict()
print("\nAs dictionary:")
print(counts_dict)
Practical Application: Data Quality Checks
Use value_counts() to identify data quality issues like unexpected values or imbalanced distributions.
# Sample dataset with potential issues
df_quality = pd.DataFrame({
'status': ['active', 'Active', 'ACTIVE', 'inactive', 'active',
'pending', 'active', 'Inactive', 'active', 'unknown'],
'priority': [1, 2, 3, 1, 2, 99, 1, 2, 1, 2]
})
# Check for case inconsistencies
print("Status value distribution:")
print(df_quality['status'].value_counts())
# Identify outliers in priority
print("\nPriority distribution:")
print(df_quality['priority'].value_counts().sort_index())
# Clean and recount
df_quality['status_clean'] = df_quality['status'].str.lower()
print("\nCleaned status distribution:")
print(df_quality['status_clean'].value_counts())
The value_counts() method is essential for exploratory data analysis, data validation, and understanding distributions in your datasets. Mastering its parameters and combining it with other Pandas operations enables sophisticated frequency analysis workflows.