Pandas - Convert Column to Categorical

Key Insights

Converting columns to categorical dtype reduces memory usage by up to 90% for columns with repeated string values, while enabling specialized categorical operations and ordered comparisons
Pandas offers multiple conversion methods including astype('category'), pd.Categorical(), and pd.cut() for binning, each suited for different use cases and performance requirements
Categorical columns with ordered categories unlock powerful comparison operations and efficient sorting, essential for ordinal data like ratings, education levels, or priority classifications

Understanding Categorical Data in Pandas

Categorical data represents a fixed set of possible values, typically strings or integers representing discrete groups. In Pandas, the categorical dtype stores data internally as integer codes mapped to category labels, dramatically reducing memory footprint compared to object dtype strings.

import pandas as pd
import numpy as np

# Create sample data with repetitive values
df = pd.DataFrame({
    'product': ['laptop', 'mouse', 'keyboard', 'laptop', 'mouse'] * 10000,
    'priority': ['high', 'low', 'medium', 'high', 'low'] * 10000
})

# Check memory usage before conversion
print(f"Object dtype memory: {df['product'].memory_usage(deep=True) / 1024:.2f} KB")

# Convert to categorical
df['product'] = df['product'].astype('category')
print(f"Categorical dtype memory: {df['product'].memory_usage(deep=True) / 1024:.2f} KB")

This conversion typically shows 80-90% memory reduction for columns with low cardinality (few unique values relative to total rows).

Basic Conversion Methods

Using astype()

The most straightforward approach converts existing columns directly:

# Single column conversion
df['product'] = df['product'].astype('category')

# Multiple columns at once
df[['product', 'priority']] = df[['product', 'priority']].astype('category')

# During DataFrame creation
df = pd.DataFrame({
    'product': pd.Categorical(['laptop', 'mouse', 'keyboard']),
    'count': [10, 5, 8]
})

Using pd.Categorical() Constructor

This method provides explicit control over categories and ordering:

# Define specific categories (including unused ones)
categories = ['laptop', 'mouse', 'keyboard', 'monitor', 'headset']
df['product'] = pd.Categorical(df['product'], categories=categories)

# Check all possible categories
print(df['product'].cat.categories)
# Output: Index(['laptop', 'mouse', 'keyboard', 'monitor', 'headset'], dtype='object')

Using convert_dtypes()

Pandas can automatically infer appropriate types:

df = pd.DataFrame({
    'product': ['laptop', 'mouse', 'keyboard'] * 100,
    'price': [1200, 25, 80] * 100
})

# Automatically convert suitable columns
df_converted = df.convert_dtypes()
print(df_converted.dtypes)

Ordered Categories

Ordered categories enable comparison operations and proper sorting for ordinal data:

# Create ordered categorical
priority_order = ['low', 'medium', 'high', 'critical']
df['priority'] = pd.Categorical(
    df['priority'],
    categories=priority_order,
    ordered=True
)

# Now comparisons work logically
high_priority = df[df['priority'] >= 'high']

# Sorting respects the defined order
df_sorted = df.sort_values('priority')

# Check if ordered
print(df['priority'].cat.ordered)  # True

For education levels, size categories, or any ordinal data:

education_data = pd.DataFrame({
    'degree': ['Bachelor', 'PhD', 'Master', 'High School', 'Bachelor']
})

education_order = ['High School', 'Bachelor', 'Master', 'PhD']
education_data['degree'] = pd.Categorical(
    education_data['degree'],
    categories=education_order,
    ordered=True
)

# Filter for advanced degrees
advanced = education_data[education_data['degree'] > 'Bachelor']

Converting Continuous Data to Categories

Using pd.cut() for Binning

Transform continuous numerical data into categorical bins:

# Age ranges
df = pd.DataFrame({'age': [25, 45, 67, 23, 89, 34, 56, 12]})

df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 60, 100],
    labels=['Child', 'Young Adult', 'Adult', 'Senior']
)

print(df)
#    age   age_group
# 0   25  Young Adult
# 1   45       Adult
# 2   67      Senior

Using pd.qcut() for Quantile-Based Binning

Create equal-sized bins based on distribution:

# Divide into quartiles
df = pd.DataFrame({'revenue': np.random.randint(1000, 50000, 1000)})

df['revenue_quartile'] = pd.qcut(
    df['revenue'],
    q=4,
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)

# Check distribution
print(df['revenue_quartile'].value_counts().sort_index())

Managing Category Operations

Adding and Removing Categories

df = pd.DataFrame({'size': ['S', 'M', 'L', 'M', 'S']})
df['size'] = df['size'].astype('category')

# Add new category
df['size'] = df['size'].cat.add_categories(['XL', 'XXL'])

# Remove unused category
df['size'] = df['size'].cat.remove_categories(['XXL'])

# Remove unused categories automatically
df['size'] = df['size'].cat.remove_unused_categories()

Renaming Categories

# Rename specific categories
df['size'] = df['size'].cat.rename_categories({
    'S': 'Small',
    'M': 'Medium',
    'L': 'Large'
})

# Or provide a complete mapping
new_names = ['Small', 'Medium', 'Large', 'Extra Large']
df['size'] = df['size'].cat.rename_categories(new_names)

Reordering Categories

df['size'] = df['size'].cat.reorder_categories(
    ['Small', 'Medium', 'Large', 'Extra Large'],
    ordered=True
)

Performance Considerations

Categorical dtype excels with low cardinality data but adds overhead for high cardinality:

import time

# Low cardinality scenario
low_card = pd.Series(['A', 'B', 'C'] * 100000)

start = time.time()
low_card_cat = low_card.astype('category')
cat_time = time.time() - start

# High cardinality scenario (many unique values)
high_card = pd.Series([f'value_{i}' for i in range(300000)])

start = time.time()
high_card_cat = high_card.astype('category')
high_card_time = time.time() - start

# Groupby operations are faster with categorical
df = pd.DataFrame({
    'category': pd.Categorical(['A', 'B', 'C'] * 100000),
    'value': np.random.randn(300000)
})

# Categorical groupby is optimized
result = df.groupby('category')['value'].mean()

Handling Missing Values

Categorical columns handle NaN values differently:

df = pd.DataFrame({
    'status': ['active', 'inactive', None, 'active', 'pending']
})

# NaN is not a category by default
df['status'] = df['status'].astype('category')
print(df['status'].cat.categories)  # ['active', 'inactive', 'pending']

# Check for missing values
print(df['status'].isna().sum())

# Fill missing with a category
df['status'] = df['status'].cat.add_categories(['unknown'])
df['status'] = df['status'].fillna('unknown')

Practical Example: Customer Segmentation

# Real-world customer data
customers = pd.DataFrame({
    'customer_id': range(1000),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 1000),
    'tier': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], 1000),
    'spend': np.random.randint(100, 10000, 1000)
})

# Convert to categorical
customers['region'] = customers['region'].astype('category')

# Ordered categorical for tier
tier_order = ['Bronze', 'Silver', 'Gold', 'Platinum']
customers['tier'] = pd.Categorical(
    customers['tier'],
    categories=tier_order,
    ordered=True
)

# Create spend categories
customers['spend_level'] = pd.qcut(
    customers['spend'],
    q=3,
    labels=['Low', 'Medium', 'High']
)

# Efficient groupby operations
summary = customers.groupby(['region', 'tier']).agg({
    'spend': ['mean', 'count']
}).round(2)

print(f"Memory saved: {customers['region'].memory_usage(deep=True) / 1024:.2f} KB")

Categorical conversion is essential for efficient data analysis in Pandas, particularly with large datasets containing repetitive string values. Choose the conversion method based on whether you need ordered categories, custom category sets, or automatic binning of continuous data.