How to Handle Categorical Data in Pandas
Categorical data appears everywhere in real-world datasets: customer segments, product categories, geographic regions, survey responses. Yet most pandas users treat these columns as plain strings,...
Key Insights
- Converting string columns to categorical dtype can reduce memory usage by 90%+ and significantly speed up groupby operations, but only when cardinality is low relative to row count.
- Ordered categories enable meaningful comparisons and sorting (e.g., “Small” < “Medium” < “Large”), which is impossible with plain strings.
- Always define explicit category sets when working with multiple DataFrames to prevent silent data loss during merges and concatenations.
Categorical data appears everywhere in real-world datasets: customer segments, product categories, geographic regions, survey responses. Yet most pandas users treat these columns as plain strings, missing out on substantial memory savings and performance gains. Let’s fix that.
Understanding Categorical Data
Categorical data represents values that belong to a fixed, finite set of possibilities. Think status codes (“pending”, “approved”, “rejected”), t-shirt sizes, or country names. This differs from numerical data (continuous measurements) and free-form text.
Pandas provides a dedicated Categorical dtype that stores these values efficiently. Instead of repeating the full string “United States” a million times, pandas stores integer codes that map to a lookup table of unique categories.
Here’s the memory impact:
import pandas as pd
import numpy as np
# Create a DataFrame with 1 million rows
n_rows = 1_000_000
regions = ['North', 'South', 'East', 'West', 'Central']
df = pd.DataFrame({
'region_string': np.random.choice(regions, n_rows),
})
# Convert to categorical
df['region_categorical'] = df['region_string'].astype('category')
# Compare memory usage
string_memory = df['region_string'].memory_usage(deep=True)
categorical_memory = df['region_categorical'].memory_usage(deep=True)
print(f"String column: {string_memory / 1024**2:.2f} MB")
print(f"Categorical column: {categorical_memory / 1024**2:.2f} MB")
print(f"Memory reduction: {(1 - categorical_memory/string_memory) * 100:.1f}%")
Output:
String column: 59.00 MB
Categorical column: 0.95 MB
Memory reduction: 98.4%
The rule of thumb: use categorical when the number of unique values is small relative to total rows. A column with 5 unique values across a million rows? Perfect candidate. A column where every value is unique? Stick with strings.
Converting Columns to Categorical Type
You have two primary methods for creating categorical columns. Use astype('category') for quick conversions on existing columns:
df = pd.DataFrame({
'status': ['pending', 'approved', 'rejected', 'pending', 'approved'],
'priority': ['high', 'low', 'medium', 'high', 'low']
})
# Convert single column
df['status'] = df['status'].astype('category')
# Convert multiple columns at once
categorical_columns = ['status', 'priority']
df[categorical_columns] = df[categorical_columns].astype('category')
print(df.dtypes)
For more control, use pd.Categorical() directly:
# Create with explicit categories (including ones not in data)
status_values = ['pending', 'approved', 'rejected', 'cancelled']
df['status'] = pd.Categorical(
df['status'],
categories=status_values,
ordered=False
)
print(df['status'].cat.categories)
# Index(['pending', 'approved', 'rejected', 'cancelled'], dtype='object')
Specifying categories explicitly matters when you know the full set of valid values upfront. This prevents issues later when new data contains values your original dataset didn’t have.
Ordered vs. Unordered Categories
By default, categories are unordered—comparing “North” and “South” makes no logical sense. But some categorical data has inherent ordering: shirt sizes, education levels, satisfaction ratings.
# Create ordered categories for t-shirt sizes
sizes = pd.Categorical(
['M', 'S', 'XL', 'L', 'M', 'S'],
categories=['XS', 'S', 'M', 'L', 'XL'],
ordered=True
)
df = pd.DataFrame({'size': sizes, 'quantity': [10, 25, 5, 15, 20, 30]})
# Now comparisons work
print(df[df['size'] > 'M']) # Returns L and XL rows
# Sorting respects logical order
print(df.sort_values('size'))
You can also convert existing unordered categories to ordered:
# Convert existing categorical to ordered
df['rating'] = pd.Categorical(
['good', 'excellent', 'poor', 'good', 'fair'],
categories=['poor', 'fair', 'good', 'excellent'],
ordered=True
)
# Find all ratings above 'fair'
above_fair = df[df['rating'] > 'fair']
print(above_fair)
This ordering persists through operations. When you group by an ordered categorical and aggregate, the result maintains the logical order—no manual sorting required.
Working with Category Properties
The .cat accessor provides methods for manipulating categories without touching the underlying data:
df = pd.DataFrame({
'department': pd.Categorical(['sales', 'engineering', 'marketing', 'sales'])
})
# View current categories
print(df['department'].cat.categories)
# Index(['engineering', 'marketing', 'sales'], dtype='object')
# Rename categories
df['department'] = df['department'].cat.rename_categories({
'sales': 'Sales Team',
'engineering': 'Engineering Team',
'marketing': 'Marketing Team'
})
# Add new categories (doesn't add data, just expands valid values)
df['department'] = df['department'].cat.add_categories(['HR Team', 'Finance Team'])
# Remove unused categories
df['department'] = df['department'].cat.remove_unused_categories()
# Reorder categories
df['department'] = df['department'].cat.reorder_categories([
'Engineering Team', 'Sales Team', 'Marketing Team'
])
A critical method is set_categories(), which replaces the entire category set:
# Reset to specific categories (values not in new set become NaN)
df['department'] = df['department'].cat.set_categories([
'Sales Team', 'Engineering Team', 'Product Team'
])
Values that don’t match the new categories silently become NaN. This behavior bites people regularly—always check for nulls after set_categories().
Encoding Categorical Data for Analysis
Machine learning models need numbers, not strings. Pandas provides get_dummies() for one-hot encoding:
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': pd.Categorical(['S', 'M', 'L', 'M', 'S'],
categories=['S', 'M', 'L'], ordered=True),
'price': [10, 20, 30, 15, 25]
})
# One-hot encode categorical columns
encoded = pd.get_dummies(df, columns=['color', 'size'], dtype=int)
print(encoded)
Output:
price color_blue color_green color_red size_S size_M size_L
0 10 0 0 1 1 0 0
1 20 1 0 0 0 1 0
2 30 0 1 0 0 0 1
3 15 0 0 1 0 1 0
4 25 1 0 0 1 0 0
For label encoding (converting categories to integers), use the underlying codes:
# Label encoding via category codes
df['size_encoded'] = df['size'].cat.codes
print(df[['size', 'size_encoded']])
Output:
size size_encoded
0 S 0
1 M 1
2 L 2
3 M 1
4 S 0
For ordered categories, the codes respect the order—useful for ordinal features in tree-based models.
Performance Benefits and Memory Optimization
Beyond memory savings, categorical columns accelerate certain operations. Groupby operations benefit significantly because pandas can work with integer codes instead of comparing strings:
import time
n_rows = 5_000_000
categories = [f'category_{i}' for i in range(100)]
df = pd.DataFrame({
'group_string': np.random.choice(categories, n_rows),
'value': np.random.randn(n_rows)
})
df['group_categorical'] = df['group_string'].astype('category')
# Benchmark string groupby
start = time.time()
result1 = df.groupby('group_string')['value'].mean()
string_time = time.time() - start
# Benchmark categorical groupby
start = time.time()
result2 = df.groupby('group_categorical')['value'].mean()
categorical_time = time.time() - start
print(f"String groupby: {string_time:.3f}s")
print(f"Categorical groupby: {categorical_time:.3f}s")
print(f"Speedup: {string_time/categorical_time:.1f}x")
Typical results show 1.5-3x speedup for groupby operations. Merge operations also benefit when joining on categorical columns with matching categories.
Common Pitfalls and Best Practices
The biggest gotcha: combining DataFrames with mismatched categories.
df1 = pd.DataFrame({
'region': pd.Categorical(['North', 'South']),
'sales': [100, 200]
})
df2 = pd.DataFrame({
'region': pd.Categorical(['East', 'West']),
'sales': [150, 250]
})
# Concatenating works but creates inconsistent categories
combined = pd.concat([df1, df2])
print(combined['region'].cat.categories)
# Index(['North', 'South', 'East', 'West'], dtype='object')
This seems fine, but problems emerge when you later try to merge with another DataFrame that has a different category order. The fix: define categories explicitly before combining:
all_regions = ['North', 'South', 'East', 'West', 'Central']
df1['region'] = df1['region'].cat.set_categories(all_regions)
df2['region'] = df2['region'].cat.set_categories(all_regions)
combined = pd.concat([df1, df2])
# Now categories are consistent
Other best practices:
- Convert early: Transform to categorical right after loading data, before any operations.
- Watch for NaN handling: Categorical columns handle NaN differently than strings. Use
cat.add_categories()if you need NaN as an explicit category. - Avoid high-cardinality categoricals: If unique values exceed ~50% of rows, categorical provides no benefit and may hurt performance.
- Be explicit about ordering: If logical ordering matters, set
ordered=Trueimmediately—retrofitting it later causes confusion.
Categorical data handling isn’t glamorous, but it’s foundational. Get it right, and your DataFrames become smaller, your operations faster, and your code more expressive. The .cat accessor is your friend—use it.