Pandas - One-Hot Encoding with get_dummies

One-hot encoding transforms categorical data into a numerical format by creating binary columns for each unique category. If you have a 'color' column with values ['red', 'blue', 'green'], pandas...

Key Insights

  • pd.get_dummies() converts categorical variables into binary columns, creating a sparse matrix representation that most machine learning algorithms require
  • The function handles multiple columns simultaneously, supports custom prefixes and separators, and can preserve or drop original columns based on your needs
  • Understanding parameters like drop_first, dtype, and dummy_na prevents common pitfalls like multicollinearity and memory bloat in production systems

Understanding One-Hot Encoding Fundamentals

One-hot encoding transforms categorical data into a numerical format by creating binary columns for each unique category. If you have a “color” column with values [“red”, “blue”, “green”], pandas creates three new columns: color_red, color_blue, and color_green, where each row gets a 1 in the appropriate column and 0s elsewhere.

import pandas as pd

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'A'],
    'color': ['red', 'blue', 'red', 'green'],
    'price': [10, 20, 15, 12]
})

encoded = pd.get_dummies(df, columns=['color'])
print(encoded)
  product  price  color_blue  color_green  color_red
0       A     10       False        False       True
1       B     20        True        False      False
2       C     15       False        False       True
3       A     12       False         True      False

By default, get_dummies() creates boolean columns. This works, but you’ll typically want integer types for machine learning pipelines.

Controlling Data Types and Column Names

The dtype parameter controls the output column type, while prefix and prefix_sep customize column naming conventions. These parameters matter when integrating with downstream systems that expect specific formats.

df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'SF'],
    'category': ['A', 'B', 'A', 'C']
})

# Integer encoding with custom prefixes
encoded = pd.get_dummies(
    df,
    columns=['city', 'category'],
    prefix=['location', 'cat'],
    prefix_sep='_',
    dtype=int
)

print(encoded)
   location_LA  location_NYC  location_SF  cat_A  cat_B  cat_C
0            0             1            0      1      0      0
1            1             0            0      0      1      0
2            0             1            0      1      0      0
3            0             0            1      0      0      1

When working with multiple categorical columns, you can pass lists to prefix to assign different prefixes to each column. If you pass a string instead, pandas uses it for all columns.

Handling the Dummy Variable Trap

The dummy variable trap occurs when encoded columns are perfectly collinear—if you know the values of n-1 dummy variables, you can deduce the nth. This causes problems in linear regression models. The drop_first parameter removes one category per column to prevent this.

df = pd.DataFrame({
    'size': ['S', 'M', 'L', 'S', 'M'],
    'quality': ['low', 'high', 'low', 'high', 'low']
})

# Without drop_first - creates dummy variable trap
full_encoding = pd.get_dummies(df, columns=['size'])
print(f"Full encoding shape: {full_encoding.shape}")
print(full_encoding.columns.tolist())

# With drop_first - removes one category per column
reduced_encoding = pd.get_dummies(df, columns=['size'], drop_first=True)
print(f"\nReduced encoding shape: {reduced_encoding.shape}")
print(reduced_encoding.columns.tolist())
Full encoding shape: (5, 4)
['quality', 'size_L', 'size_M', 'size_S']

Reduced encoding shape: (5, 3)
['quality', 'size_M', 'size_S']

For linear models, always use drop_first=True. For tree-based models like Random Forest or XGBoost, keeping all categories often performs better since these algorithms handle collinearity differently.

Processing Missing Values

The dummy_na parameter determines whether pandas creates a separate column for NaN values. This is critical when missing data carries information—for example, when “not answered” differs from “answered with option X.”

df = pd.DataFrame({
    'status': ['active', 'inactive', None, 'active', None],
    'value': [100, 200, 150, 175, 225]
})

# Without dummy_na - NaN rows get all zeros
without_na = pd.get_dummies(df, columns=['status'], dtype=int)
print("Without dummy_na:")
print(without_na)

# With dummy_na - creates explicit NaN column
with_na = pd.get_dummies(df, columns=['status'], dummy_na=True, dtype=int)
print("\nWith dummy_na:")
print(with_na)
Without dummy_na:
   value  status_active  status_inactive
0    100              1                0
1    200              0                1
2    150              0                0
3    175              1                0
4    225              0                0

With dummy_na:
   value  status_active  status_inactive  status_nan
0    100              1                0           0
1    200              0                1           0
2    150              0                0           1
3    175              1                0           0
4    225              0                0           1

Selective Column Encoding

You don’t need to encode all categorical columns. The columns parameter lets you specify exactly which columns to transform, leaving others untouched.

df = pd.DataFrame({
    'user_id': ['U1', 'U2', 'U3', 'U4'],
    'country': ['US', 'UK', 'US', 'CA'],
    'device': ['mobile', 'desktop', 'mobile', 'tablet'],
    'score': [85, 92, 78, 88]
})

# Encode only device, keep country as-is
encoded = pd.get_dummies(df, columns=['device'], dtype=int)
print(encoded)
  user_id country  score  device_desktop  device_mobile  device_tablet
0      U1      US     85               0              1              0
1      U2      UK     92               1              0              0
2      U3      US     78               0              1              0
3      U4      CA     88               0              0              1

This approach is useful when you want to preserve certain categorical columns for grouping operations or when some categories should be encoded differently (like target encoding for high-cardinality features).

Working with Sparse Matrices

When dealing with high-cardinality categorical variables, one-hot encoding creates many columns, most filled with zeros. The sparse parameter returns sparse matrices that store only non-zero values, dramatically reducing memory usage.

import numpy as np

# Create dataset with high-cardinality feature
categories = [f'cat_{i}' for i in range(100)]
df = pd.DataFrame({
    'id': range(1000),
    'category': np.random.choice(categories, 1000),
    'value': np.random.randn(1000)
})

# Dense encoding
dense = pd.get_dummies(df, columns=['category'], dtype=int)
print(f"Dense memory usage: {dense.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Sparse encoding
sparse = pd.get_dummies(df, columns=['category'], sparse=True, dtype=int)
print(f"Sparse memory usage: {sparse.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Memory reduction: {(1 - sparse.memory_usage(deep=True).sum() / dense.memory_usage(deep=True).sum()) * 100:.1f}%")

Sparse matrices are particularly valuable when working with text data or user-item interactions where most values are zero. Many scikit-learn estimators accept sparse matrices directly, so you can maintain this efficiency throughout your pipeline.

Encoding Entire DataFrames

When called without the columns parameter, get_dummies() encodes all object and categorical dtype columns automatically. This is convenient for quick prototyping but dangerous in production—adding a new categorical column to your data source will silently change your feature space.

df = pd.DataFrame({
    'numeric_id': [1, 2, 3, 4],
    'category_a': ['X', 'Y', 'X', 'Z'],
    'category_b': ['low', 'high', 'medium', 'low'],
    'amount': [100.5, 200.3, 150.7, 175.2]
})

# Encodes all non-numeric columns
encoded = pd.get_dummies(df, dtype=int)
print(encoded.columns.tolist())
['numeric_id', 'amount', 'category_a_X', 'category_a_Y', 'category_a_Z', 
 'category_b_high', 'category_b_low', 'category_b_medium']

For production systems, explicitly specify the columns parameter and validate that your expected columns exist before encoding. This prevents silent failures when data schemas change.

Integration with Machine Learning Pipelines

One-hot encoding typically occurs after train-test split to prevent data leakage. Here’s a pattern that maintains consistency between training and test sets:

from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'feature_1': np.random.choice(['A', 'B', 'C'], 100),
    'feature_2': np.random.choice(['X', 'Y'], 100),
    'target': np.random.randint(0, 2, 100)
})

# Split first
train, test = train_test_split(df, test_size=0.2, random_state=42)

# Encode training data
train_encoded = pd.get_dummies(
    train, 
    columns=['feature_1', 'feature_2'],
    drop_first=True,
    dtype=int
)

# Encode test data with same structure
test_encoded = pd.get_dummies(
    test,
    columns=['feature_1', 'feature_2'],
    drop_first=True,
    dtype=int
)

# Align columns to match training set
test_encoded = test_encoded.reindex(columns=train_encoded.columns, fill_value=0)

print(f"Train shape: {train_encoded.shape}")
print(f"Test shape: {test_encoded.shape}")
print(f"Columns match: {train_encoded.columns.equals(test_encoded.columns)}")

The reindex() call ensures test data has the same columns as training data, adding missing columns with zeros. This handles cases where certain categories appear only in the training set.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.