Pandas - Convert Column to Categorical
Categorical data represents a fixed set of possible values, typically strings or integers representing discrete groups. In Pandas, the categorical dtype stores data internally as integer codes mapped...
Key Insights
- Converting columns to categorical dtype reduces memory usage by up to 90% for columns with repeated string values, while enabling specialized categorical operations and ordered comparisons
- Pandas offers multiple conversion methods including
astype('category'),pd.Categorical(), andpd.cut()for binning, each suited for different use cases and performance requirements - Categorical columns with ordered categories unlock powerful comparison operations and efficient sorting, essential for ordinal data like ratings, education levels, or priority classifications
Understanding Categorical Data in Pandas
Categorical data represents a fixed set of possible values, typically strings or integers representing discrete groups. In Pandas, the categorical dtype stores data internally as integer codes mapped to category labels, dramatically reducing memory footprint compared to object dtype strings.
import pandas as pd
import numpy as np
# Create sample data with repetitive values
df = pd.DataFrame({
'product': ['laptop', 'mouse', 'keyboard', 'laptop', 'mouse'] * 10000,
'priority': ['high', 'low', 'medium', 'high', 'low'] * 10000
})
# Check memory usage before conversion
print(f"Object dtype memory: {df['product'].memory_usage(deep=True) / 1024:.2f} KB")
# Convert to categorical
df['product'] = df['product'].astype('category')
print(f"Categorical dtype memory: {df['product'].memory_usage(deep=True) / 1024:.2f} KB")
This conversion typically shows 80-90% memory reduction for columns with low cardinality (few unique values relative to total rows).
Basic Conversion Methods
Using astype()
The most straightforward approach converts existing columns directly:
# Single column conversion
df['product'] = df['product'].astype('category')
# Multiple columns at once
df[['product', 'priority']] = df[['product', 'priority']].astype('category')
# During DataFrame creation
df = pd.DataFrame({
'product': pd.Categorical(['laptop', 'mouse', 'keyboard']),
'count': [10, 5, 8]
})
Using pd.Categorical() Constructor
This method provides explicit control over categories and ordering:
# Define specific categories (including unused ones)
categories = ['laptop', 'mouse', 'keyboard', 'monitor', 'headset']
df['product'] = pd.Categorical(df['product'], categories=categories)
# Check all possible categories
print(df['product'].cat.categories)
# Output: Index(['laptop', 'mouse', 'keyboard', 'monitor', 'headset'], dtype='object')
Using convert_dtypes()
Pandas can automatically infer appropriate types:
df = pd.DataFrame({
'product': ['laptop', 'mouse', 'keyboard'] * 100,
'price': [1200, 25, 80] * 100
})
# Automatically convert suitable columns
df_converted = df.convert_dtypes()
print(df_converted.dtypes)
Ordered Categories
Ordered categories enable comparison operations and proper sorting for ordinal data:
# Create ordered categorical
priority_order = ['low', 'medium', 'high', 'critical']
df['priority'] = pd.Categorical(
df['priority'],
categories=priority_order,
ordered=True
)
# Now comparisons work logically
high_priority = df[df['priority'] >= 'high']
# Sorting respects the defined order
df_sorted = df.sort_values('priority')
# Check if ordered
print(df['priority'].cat.ordered) # True
For education levels, size categories, or any ordinal data:
education_data = pd.DataFrame({
'degree': ['Bachelor', 'PhD', 'Master', 'High School', 'Bachelor']
})
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
education_data['degree'] = pd.Categorical(
education_data['degree'],
categories=education_order,
ordered=True
)
# Filter for advanced degrees
advanced = education_data[education_data['degree'] > 'Bachelor']
Converting Continuous Data to Categories
Using pd.cut() for Binning
Transform continuous numerical data into categorical bins:
# Age ranges
df = pd.DataFrame({'age': [25, 45, 67, 23, 89, 34, 56, 12]})
df['age_group'] = pd.cut(
df['age'],
bins=[0, 18, 35, 60, 100],
labels=['Child', 'Young Adult', 'Adult', 'Senior']
)
print(df)
# age age_group
# 0 25 Young Adult
# 1 45 Adult
# 2 67 Senior
Using pd.qcut() for Quantile-Based Binning
Create equal-sized bins based on distribution:
# Divide into quartiles
df = pd.DataFrame({'revenue': np.random.randint(1000, 50000, 1000)})
df['revenue_quartile'] = pd.qcut(
df['revenue'],
q=4,
labels=['Q1', 'Q2', 'Q3', 'Q4']
)
# Check distribution
print(df['revenue_quartile'].value_counts().sort_index())
Managing Category Operations
Adding and Removing Categories
df = pd.DataFrame({'size': ['S', 'M', 'L', 'M', 'S']})
df['size'] = df['size'].astype('category')
# Add new category
df['size'] = df['size'].cat.add_categories(['XL', 'XXL'])
# Remove unused category
df['size'] = df['size'].cat.remove_categories(['XXL'])
# Remove unused categories automatically
df['size'] = df['size'].cat.remove_unused_categories()
Renaming Categories
# Rename specific categories
df['size'] = df['size'].cat.rename_categories({
'S': 'Small',
'M': 'Medium',
'L': 'Large'
})
# Or provide a complete mapping
new_names = ['Small', 'Medium', 'Large', 'Extra Large']
df['size'] = df['size'].cat.rename_categories(new_names)
Reordering Categories
df['size'] = df['size'].cat.reorder_categories(
['Small', 'Medium', 'Large', 'Extra Large'],
ordered=True
)
Performance Considerations
Categorical dtype excels with low cardinality data but adds overhead for high cardinality:
import time
# Low cardinality scenario
low_card = pd.Series(['A', 'B', 'C'] * 100000)
start = time.time()
low_card_cat = low_card.astype('category')
cat_time = time.time() - start
# High cardinality scenario (many unique values)
high_card = pd.Series([f'value_{i}' for i in range(300000)])
start = time.time()
high_card_cat = high_card.astype('category')
high_card_time = time.time() - start
# Groupby operations are faster with categorical
df = pd.DataFrame({
'category': pd.Categorical(['A', 'B', 'C'] * 100000),
'value': np.random.randn(300000)
})
# Categorical groupby is optimized
result = df.groupby('category')['value'].mean()
Handling Missing Values
Categorical columns handle NaN values differently:
df = pd.DataFrame({
'status': ['active', 'inactive', None, 'active', 'pending']
})
# NaN is not a category by default
df['status'] = df['status'].astype('category')
print(df['status'].cat.categories) # ['active', 'inactive', 'pending']
# Check for missing values
print(df['status'].isna().sum())
# Fill missing with a category
df['status'] = df['status'].cat.add_categories(['unknown'])
df['status'] = df['status'].fillna('unknown')
Practical Example: Customer Segmentation
# Real-world customer data
customers = pd.DataFrame({
'customer_id': range(1000),
'region': np.random.choice(['North', 'South', 'East', 'West'], 1000),
'tier': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], 1000),
'spend': np.random.randint(100, 10000, 1000)
})
# Convert to categorical
customers['region'] = customers['region'].astype('category')
# Ordered categorical for tier
tier_order = ['Bronze', 'Silver', 'Gold', 'Platinum']
customers['tier'] = pd.Categorical(
customers['tier'],
categories=tier_order,
ordered=True
)
# Create spend categories
customers['spend_level'] = pd.qcut(
customers['spend'],
q=3,
labels=['Low', 'Medium', 'High']
)
# Efficient groupby operations
summary = customers.groupby(['region', 'tier']).agg({
'spend': ['mean', 'count']
}).round(2)
print(f"Memory saved: {customers['region'].memory_usage(deep=True) / 1024:.2f} KB")
Categorical conversion is essential for efficient data analysis in Pandas, particularly with large datasets containing repetitive string values. Choose the conversion method based on whether you need ordered categories, custom category sets, or automatic binning of continuous data.