Pandas - One-Hot Encoding with get_dummies
One-hot encoding transforms categorical data into a numerical format by creating binary columns for each unique category. If you have a 'color' column with values ['red', 'blue', 'green'], pandas...
Key Insights
pd.get_dummies()converts categorical variables into binary columns, creating a sparse matrix representation that most machine learning algorithms require- The function handles multiple columns simultaneously, supports custom prefixes and separators, and can preserve or drop original columns based on your needs
- Understanding parameters like
drop_first,dtype, anddummy_naprevents common pitfalls like multicollinearity and memory bloat in production systems
Understanding One-Hot Encoding Fundamentals
One-hot encoding transforms categorical data into a numerical format by creating binary columns for each unique category. If you have a “color” column with values [“red”, “blue”, “green”], pandas creates three new columns: color_red, color_blue, and color_green, where each row gets a 1 in the appropriate column and 0s elsewhere.
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'A'],
'color': ['red', 'blue', 'red', 'green'],
'price': [10, 20, 15, 12]
})
encoded = pd.get_dummies(df, columns=['color'])
print(encoded)
product price color_blue color_green color_red
0 A 10 False False True
1 B 20 True False False
2 C 15 False False True
3 A 12 False True False
By default, get_dummies() creates boolean columns. This works, but you’ll typically want integer types for machine learning pipelines.
Controlling Data Types and Column Names
The dtype parameter controls the output column type, while prefix and prefix_sep customize column naming conventions. These parameters matter when integrating with downstream systems that expect specific formats.
df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'SF'],
'category': ['A', 'B', 'A', 'C']
})
# Integer encoding with custom prefixes
encoded = pd.get_dummies(
df,
columns=['city', 'category'],
prefix=['location', 'cat'],
prefix_sep='_',
dtype=int
)
print(encoded)
location_LA location_NYC location_SF cat_A cat_B cat_C
0 0 1 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 1 0 0
3 0 0 1 0 0 1
When working with multiple categorical columns, you can pass lists to prefix to assign different prefixes to each column. If you pass a string instead, pandas uses it for all columns.
Handling the Dummy Variable Trap
The dummy variable trap occurs when encoded columns are perfectly collinear—if you know the values of n-1 dummy variables, you can deduce the nth. This causes problems in linear regression models. The drop_first parameter removes one category per column to prevent this.
df = pd.DataFrame({
'size': ['S', 'M', 'L', 'S', 'M'],
'quality': ['low', 'high', 'low', 'high', 'low']
})
# Without drop_first - creates dummy variable trap
full_encoding = pd.get_dummies(df, columns=['size'])
print(f"Full encoding shape: {full_encoding.shape}")
print(full_encoding.columns.tolist())
# With drop_first - removes one category per column
reduced_encoding = pd.get_dummies(df, columns=['size'], drop_first=True)
print(f"\nReduced encoding shape: {reduced_encoding.shape}")
print(reduced_encoding.columns.tolist())
Full encoding shape: (5, 4)
['quality', 'size_L', 'size_M', 'size_S']
Reduced encoding shape: (5, 3)
['quality', 'size_M', 'size_S']
For linear models, always use drop_first=True. For tree-based models like Random Forest or XGBoost, keeping all categories often performs better since these algorithms handle collinearity differently.
Processing Missing Values
The dummy_na parameter determines whether pandas creates a separate column for NaN values. This is critical when missing data carries information—for example, when “not answered” differs from “answered with option X.”
df = pd.DataFrame({
'status': ['active', 'inactive', None, 'active', None],
'value': [100, 200, 150, 175, 225]
})
# Without dummy_na - NaN rows get all zeros
without_na = pd.get_dummies(df, columns=['status'], dtype=int)
print("Without dummy_na:")
print(without_na)
# With dummy_na - creates explicit NaN column
with_na = pd.get_dummies(df, columns=['status'], dummy_na=True, dtype=int)
print("\nWith dummy_na:")
print(with_na)
Without dummy_na:
value status_active status_inactive
0 100 1 0
1 200 0 1
2 150 0 0
3 175 1 0
4 225 0 0
With dummy_na:
value status_active status_inactive status_nan
0 100 1 0 0
1 200 0 1 0
2 150 0 0 1
3 175 1 0 0
4 225 0 0 1
Selective Column Encoding
You don’t need to encode all categorical columns. The columns parameter lets you specify exactly which columns to transform, leaving others untouched.
df = pd.DataFrame({
'user_id': ['U1', 'U2', 'U3', 'U4'],
'country': ['US', 'UK', 'US', 'CA'],
'device': ['mobile', 'desktop', 'mobile', 'tablet'],
'score': [85, 92, 78, 88]
})
# Encode only device, keep country as-is
encoded = pd.get_dummies(df, columns=['device'], dtype=int)
print(encoded)
user_id country score device_desktop device_mobile device_tablet
0 U1 US 85 0 1 0
1 U2 UK 92 1 0 0
2 U3 US 78 0 1 0
3 U4 CA 88 0 0 1
This approach is useful when you want to preserve certain categorical columns for grouping operations or when some categories should be encoded differently (like target encoding for high-cardinality features).
Working with Sparse Matrices
When dealing with high-cardinality categorical variables, one-hot encoding creates many columns, most filled with zeros. The sparse parameter returns sparse matrices that store only non-zero values, dramatically reducing memory usage.
import numpy as np
# Create dataset with high-cardinality feature
categories = [f'cat_{i}' for i in range(100)]
df = pd.DataFrame({
'id': range(1000),
'category': np.random.choice(categories, 1000),
'value': np.random.randn(1000)
})
# Dense encoding
dense = pd.get_dummies(df, columns=['category'], dtype=int)
print(f"Dense memory usage: {dense.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Sparse encoding
sparse = pd.get_dummies(df, columns=['category'], sparse=True, dtype=int)
print(f"Sparse memory usage: {sparse.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Memory reduction: {(1 - sparse.memory_usage(deep=True).sum() / dense.memory_usage(deep=True).sum()) * 100:.1f}%")
Sparse matrices are particularly valuable when working with text data or user-item interactions where most values are zero. Many scikit-learn estimators accept sparse matrices directly, so you can maintain this efficiency throughout your pipeline.
Encoding Entire DataFrames
When called without the columns parameter, get_dummies() encodes all object and categorical dtype columns automatically. This is convenient for quick prototyping but dangerous in production—adding a new categorical column to your data source will silently change your feature space.
df = pd.DataFrame({
'numeric_id': [1, 2, 3, 4],
'category_a': ['X', 'Y', 'X', 'Z'],
'category_b': ['low', 'high', 'medium', 'low'],
'amount': [100.5, 200.3, 150.7, 175.2]
})
# Encodes all non-numeric columns
encoded = pd.get_dummies(df, dtype=int)
print(encoded.columns.tolist())
['numeric_id', 'amount', 'category_a_X', 'category_a_Y', 'category_a_Z',
'category_b_high', 'category_b_low', 'category_b_medium']
For production systems, explicitly specify the columns parameter and validate that your expected columns exist before encoding. This prevents silent failures when data schemas change.
Integration with Machine Learning Pipelines
One-hot encoding typically occurs after train-test split to prevent data leakage. Here’s a pattern that maintains consistency between training and test sets:
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
'feature_1': np.random.choice(['A', 'B', 'C'], 100),
'feature_2': np.random.choice(['X', 'Y'], 100),
'target': np.random.randint(0, 2, 100)
})
# Split first
train, test = train_test_split(df, test_size=0.2, random_state=42)
# Encode training data
train_encoded = pd.get_dummies(
train,
columns=['feature_1', 'feature_2'],
drop_first=True,
dtype=int
)
# Encode test data with same structure
test_encoded = pd.get_dummies(
test,
columns=['feature_1', 'feature_2'],
drop_first=True,
dtype=int
)
# Align columns to match training set
test_encoded = test_encoded.reindex(columns=train_encoded.columns, fill_value=0)
print(f"Train shape: {train_encoded.shape}")
print(f"Test shape: {test_encoded.shape}")
print(f"Columns match: {train_encoded.columns.equals(test_encoded.columns)}")
The reindex() call ensures test data has the same columns as training data, adding missing columns with zeros. This handles cases where certain categories appear only in the training set.