Pandas - Dummy Variables (get_dummies)

Dummy variables transform categorical data into a binary format where each unique category becomes a separate column with 1/0 values. This encoding is critical because most machine learning...

Key Insights

  • pd.get_dummies() converts categorical variables into binary indicator columns, essential for machine learning models that require numerical input
  • Control dummy variable trap with drop_first=True to remove multicollinearity, and use prefix parameters to maintain column naming clarity in production code
  • For large datasets or production pipelines, combine get_dummies() with pd.concat() and consider memory-efficient alternatives like pd.factorize() or sklearn’s OneHotEncoder for consistent train/test encoding

Understanding Dummy Variables

Dummy variables transform categorical data into a binary format where each unique category becomes a separate column with 1/0 values. This encoding is critical because most machine learning algorithms operate on numerical data exclusively.

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'region': ['North', 'South', 'East', 'North', 'West'],
    'subscription': ['Premium', 'Basic', 'Premium', 'Basic', 'Standard']
})

# Basic dummy variable creation
dummies = pd.get_dummies(df['region'])
print(dummies)

Output:

   East  North  South  West
0     0      1      0     0
1     0      0      1     0
2     1      0      0     0
3     0      1      0     0
4     0      0      0     1

Handling the Dummy Variable Trap

The dummy variable trap occurs when independent variables are highly correlated, causing multicollinearity in regression models. If you have n categories, you only need n-1 dummy variables because the last category is implicitly defined when all others are zero.

# Avoid multicollinearity with drop_first
df_encoded = pd.get_dummies(df['region'], drop_first=True)
print(df_encoded)

# Output shows one less column
#    North  South  West
# 0      1      0     0
# 1      0      1     0
# 2      0      0     0  # East is the reference category
# 3      1      0     0
# 4      0      0     1

For multiple categorical columns:

# Encode entire DataFrame
df_full = pd.get_dummies(df, columns=['region', 'subscription'], drop_first=True)
print(df_full)

Prefix Management for Column Clarity

When encoding multiple categorical variables, prefixes prevent column name collisions and maintain traceability to original features.

# Without prefix - ambiguous column names
bad_practice = pd.get_dummies(df[['region', 'subscription']])
print(bad_practice.columns.tolist())
# ['East', 'North', 'South', 'West', 'Basic', 'Premium', 'Standard']

# With automatic prefix from column names
good_practice = pd.get_dummies(df, columns=['region', 'subscription'])
print(good_practice.columns.tolist())
# ['customer_id', 'region_East', 'region_North', 'region_South', 
#  'region_West', 'subscription_Basic', 'subscription_Premium', 
#  'subscription_Standard']

# Custom prefix for clarity
custom_prefix = pd.get_dummies(df['region'], prefix='geo')
print(custom_prefix.columns.tolist())
# ['geo_East', 'geo_North', 'geo_South', 'geo_West']

Handling Missing Values

By default, get_dummies() ignores NaN values. Use dummy_na=True to create an indicator column for missing data.

df_missing = pd.DataFrame({
    'status': ['active', 'inactive', np.nan, 'active', 'pending', np.nan]
})

# Default behavior - NaN rows get all zeros
default_handling = pd.get_dummies(df_missing['status'])
print(default_handling)

# Explicit NaN handling
with_nan = pd.get_dummies(df_missing['status'], dummy_na=True)
print(with_nan)
#    active  inactive  pending  NaN
# 0       1         0        0    0
# 1       0         1        0    0
# 2       0         0        0    1  # NaN gets its own column
# 3       1         0        0    0
# 4       0         0        1    0
# 5       0         0        0    1

Production Pattern: Consistent Train-Test Encoding

A common pitfall is encoding train and test sets separately, resulting in different column structures. Here’s the production-ready approach:

from sklearn.model_selection import train_test_split

# Sample dataset
df_sales = pd.DataFrame({
    'product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'D'],
    'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'sales': [100, 150, 200, 120, 180, 210, 110, 190]
})

train, test = train_test_split(df_sales, test_size=0.25, random_state=42)

# WRONG: Encoding separately causes column mismatch
train_wrong = pd.get_dummies(train, columns=['product', 'region'])
test_wrong = pd.get_dummies(test, columns=['product', 'region'])
# Test set might be missing 'product_D' if it wasn't in the sample

# CORRECT: Get all categories from training data
train_encoded = pd.get_dummies(train, columns=['product', 'region'])

# Reindex test set to match training columns
test_encoded = pd.get_dummies(test, columns=['product', 'region'])
test_encoded = test_encoded.reindex(columns=train_encoded.columns, fill_value=0)

print(f"Train columns: {len(train_encoded.columns)}")
print(f"Test columns: {len(test_encoded.columns)}")
print(f"Columns match: {train_encoded.columns.equals(test_encoded.columns)}")

Memory-Efficient Alternatives

For datasets with high cardinality categorical variables, get_dummies() creates sparse matrices that consume significant memory.

# High cardinality example
df_large = pd.DataFrame({
    'user_id': range(10000),
    'city': np.random.choice([f'City_{i}' for i in range(1000)], 10000)
})

# Standard approach - memory intensive
dense_dummies = pd.get_dummies(df_large, columns=['city'])
print(f"Dense shape: {dense_dummies.shape}")  # (10000, 1001)
print(f"Memory: {dense_dummies.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Sparse matrix approach
sparse_dummies = pd.get_dummies(df_large, columns=['city'], sparse=True)
print(f"Sparse memory: {sparse_dummies.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Alternative: factorize for ordinal encoding
df_large['city_code'] = pd.factorize(df_large['city'])[0]
print(f"Factorized memory: {df_large.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Combining with Original DataFrame

Integrate dummy variables back into your dataset while preserving other columns:

df_original = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'category': ['A', 'B', 'A', 'C'],
    'value': [10, 20, 30, 40],
    'date': pd.date_range('2024-01-01', periods=4)
})

# Method 1: Specify columns parameter
result1 = pd.get_dummies(df_original, columns=['category'], drop_first=True)

# Method 2: Manual concatenation for more control
dummies = pd.get_dummies(df_original['category'], prefix='cat', drop_first=True)
result2 = pd.concat([df_original.drop('category', axis=1), dummies], axis=1)

print(result2)
#    id  value       date  cat_B  cat_C
# 0   1     10 2024-01-01      0      0
# 1   2     20 2024-01-02      1      0
# 2   3     30 2024-01-03      0      0
# 3   4     40 2024-01-04      0      1

Data Type Optimization

Control the data type of dummy variables to optimize memory usage:

df_types = pd.DataFrame({
    'category': ['Low', 'Medium', 'High', 'Low', 'High']
})

# Default: uint8 dtype
default = pd.get_dummies(df_types['category'])
print(f"Default dtype: {default.dtypes.unique()}")  # uint8

# Force different dtype
as_int = pd.get_dummies(df_types['category'], dtype=int)
print(f"Int dtype: {as_int.dtypes.unique()}")  # int64

# Boolean for logical operations
as_bool = pd.get_dummies(df_types['category'], dtype=bool)
print(f"Bool dtype: {as_bool.dtypes.unique()}")  # bool

Real-World Pipeline Example

def prepare_features(df, categorical_cols, drop_first=True):
    """
    Production function for consistent dummy variable encoding
    """
    # Store original columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Create dummies
    df_encoded = pd.get_dummies(
        df,
        columns=categorical_cols,
        drop_first=drop_first,
        dtype=np.uint8
    )
    
    # Return encoded DataFrame and column names for future use
    feature_columns = df_encoded.columns.tolist()
    
    return df_encoded, feature_columns

# Usage
df_train = pd.DataFrame({
    'age': [25, 30, 35, 40],
    'city': ['NYC', 'LA', 'NYC', 'SF'],
    'income': [50000, 60000, 70000, 80000]
})

df_encoded, columns = prepare_features(df_train, categorical_cols=['city'])
print(df_encoded.head())

Use get_dummies() for rapid prototyping and datasets that fit in memory. For production systems with evolving categories or large-scale data, consider sklearn’s OneHotEncoder which handles unseen categories and integrates with ML pipelines more seamlessly.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.