How to Use Get Dummies in Pandas
Machine learning algorithms work with numbers, not strings. When your dataset contains categorical variables like 'red', 'blue', or 'green', you need to convert them into a numerical format. One-hot...
Key Insights
pd.get_dummies()transforms categorical variables into binary columns that machine learning algorithms can process, but you must usedrop_first=Trueto avoid multicollinearity in regression models.- Always specify the
columnsparameter explicitly to control which columns get encoded—relying on automatic detection leads to brittle pipelines that break when data types change. - For production ML pipelines, consider using scikit-learn’s
OneHotEncoderinstead, asget_dummies()doesn’t handle unseen categories in test data gracefully.
Introduction to One-Hot Encoding
Machine learning algorithms work with numbers, not strings. When your dataset contains categorical variables like “red”, “blue”, or “green”, you need to convert them into a numerical format. One-hot encoding creates binary columns for each category, where a 1 indicates the presence of that category and 0 indicates absence.
Pandas provides pd.get_dummies() as a quick way to perform this transformation. It’s straightforward for exploratory analysis and prototyping, though it has limitations for production systems that we’ll address later.
Here’s what one-hot encoding looks like in practice:
import pandas as pd
# Original data with categorical column
df = pd.DataFrame({
'product': ['laptop', 'phone', 'tablet', 'laptop', 'phone'],
'price': [999, 699, 449, 1299, 899]
})
print("Before encoding:")
print(df)
# Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['product'])
print("\nAfter encoding:")
print(df_encoded)
Output:
Before encoding:
product price
0 laptop 999
1 phone 699
2 tablet 449
3 laptop 1299
4 phone 899
After encoding:
price product_laptop product_phone product_tablet
0 999 True False False
1 699 False True False
2 449 False False True
3 1299 True False False
4 899 False True False
The categorical product column becomes three binary columns. Each row has exactly one True value across these new columns.
Basic Syntax and Parameters
The function signature contains several parameters worth understanding:
pd.get_dummies(
data,
prefix=None,
prefix_sep='_',
dummy_na=False,
columns=None,
sparse=False,
drop_first=False,
dtype=None
)
The most important parameters:
columns: List of column names to encode. IfNone, encodes all object/category dtype columns.prefix: String or list of strings to prepend to output column names.drop_first: Remove the first category to avoid multicollinearity.dtype: Output dtype for new columns. Defaults toboolin recent pandas versions.
Here’s basic usage on a single column:
import pandas as pd
df = pd.DataFrame({
'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA'],
'temperature': [75, 85, 70, 72, 88]
})
# Encode with integer output instead of boolean
encoded = pd.get_dummies(df, columns=['city'], dtype=int)
print(encoded)
Output:
temperature city_Chicago city_LA city_NYC
0 75 0 0 1
1 85 0 1 0
2 70 1 0 0
3 72 0 0 1
4 88 0 1 0
Using dtype=int produces 0/1 integers instead of True/False booleans, which some ML libraries prefer.
Handling Multiple Categorical Columns
Real datasets typically contain multiple categorical features. You can encode them all in one call:
import pandas as pd
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['S', 'M', 'L', 'M', 'S'],
'category': ['electronics', 'clothing', 'electronics', 'clothing', 'electronics'],
'price': [29.99, 49.99, 19.99, 39.99, 24.99]
})
# Encode all categorical columns
encoded = pd.get_dummies(df, columns=['color', 'size', 'category'], dtype=int)
print(encoded)
Output:
price color_blue color_green color_red size_L size_M size_S category_clothing category_electronics
0 29.99 0 0 1 0 0 1 0 1
1 49.99 1 0 0 0 1 0 1 0
2 19.99 0 1 0 1 0 0 0 1
3 39.99 0 0 1 0 1 0 1 0
4 24.99 1 0 0 0 0 1 0 1
I strongly recommend always specifying the columns parameter explicitly. Without it, get_dummies() automatically encodes all columns with object or category dtype. This creates problems when:
- A numeric column gets read as object due to a single malformed value
- You add new columns to your pipeline that shouldn’t be encoded
- Column dtypes change between environments
Explicit is better than implicit. Specify your columns.
Avoiding the Dummy Variable Trap
The dummy variable trap occurs when encoded columns are perfectly multicollinear—one column can be predicted from the others. If you know a product is not a laptop and not a phone, it must be a tablet. This redundancy causes problems in linear regression and similar models.
The solution is dropping one category per encoded variable:
import pandas as pd
df = pd.DataFrame({
'region': ['North', 'South', 'East', 'West', 'North'],
'sales': [100, 150, 120, 180, 110]
})
# Without drop_first - creates multicollinearity
full_encoding = pd.get_dummies(df, columns=['region'], dtype=int)
print("Full encoding (4 columns):")
print(full_encoding)
# With drop_first - avoids multicollinearity
reduced_encoding = pd.get_dummies(df, columns=['region'], drop_first=True, dtype=int)
print("\nReduced encoding (3 columns):")
print(reduced_encoding)
Output:
Full encoding (4 columns):
sales region_East region_North region_South region_West
0 100 0 1 0 0
1 150 0 0 1 0
2 120 1 0 0 0
3 180 0 0 0 1
4 110 0 1 0 0
Reduced encoding (3 columns):
sales region_North region_South region_West
0 100 1 0 0
1 150 0 1 0
2 120 0 0 0
3 180 0 0 1
4 110 1 0 0
With drop_first=True, “East” becomes the reference category. A row with all zeros in the region columns represents East. This preserves all information while eliminating redundancy.
Use drop_first=True for regression models. Tree-based models like Random Forest and XGBoost don’t require this since they’re not affected by multicollinearity.
Customizing Output with Prefixes
Default column names like region_North work fine, but you might want cleaner names for reporting or when column names are long:
import pandas as pd
df = pd.DataFrame({
'payment_method': ['credit_card', 'paypal', 'bank_transfer'],
'shipping_option': ['standard', 'express', 'overnight'],
'amount': [99.99, 149.99, 249.99]
})
# Custom prefixes for cleaner names
encoded = pd.get_dummies(
df,
columns=['payment_method', 'shipping_option'],
prefix=['pay', 'ship'],
prefix_sep='.',
dtype=int
)
print(encoded.columns.tolist())
Output:
['amount', 'pay.credit_card', 'pay.paypal', 'pay.bank_transfer', 'ship.standard', 'ship.express', 'ship.overnight']
Using a dot separator (prefix_sep='.') and short prefixes creates more readable feature names, especially when you have many categorical variables.
Working with NaN Values
Missing values in categorical columns need explicit handling. By default, get_dummies() ignores NaN values, leaving all dummy columns as 0 for that row:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'status': ['active', 'inactive', np.nan, 'active', np.nan],
'value': [100, 200, 150, 175, 125]
})
# Default behavior - NaN rows have all zeros
default_encoding = pd.get_dummies(df, columns=['status'], dtype=int)
print("Default (NaN ignored):")
print(default_encoding)
# With dummy_na - creates explicit NaN indicator
nan_encoding = pd.get_dummies(df, columns=['status'], dummy_na=True, dtype=int)
print("\nWith dummy_na=True:")
print(nan_encoding)
Output:
Default (NaN ignored):
value status_active status_inactive
0 100 1 0
1 200 0 1
2 150 0 0
3 175 1 0
4 125 0 0
With dummy_na=True:
value status_active status_inactive status_nan
0 100 1 0 0
1 200 0 1 0
2 150 0 0 1
3 175 1 0 0
4 125 0 0 1
Use dummy_na=True when missingness itself carries information. For example, a missing status might indicate incomplete registration, which could be predictive.
Practical Example: Preparing Data for Machine Learning
Let’s combine these concepts in a realistic workflow:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create sample dataset
np.random.seed(42)
n_samples = 1000
df = pd.DataFrame({
'age': np.random.randint(18, 70, n_samples),
'income': np.random.randint(30000, 150000, n_samples),
'education': np.random.choice(['high_school', 'bachelors', 'masters', 'phd'], n_samples),
'employment': np.random.choice(['full_time', 'part_time', 'self_employed'], n_samples),
'region': np.random.choice(['urban', 'suburban', 'rural'], n_samples),
'purchased': np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
})
# Introduce some missing values
df.loc[np.random.choice(n_samples, 50, replace=False), 'education'] = np.nan
print("Original data shape:", df.shape)
print("\nSample rows:")
print(df.head())
# Define categorical columns explicitly
categorical_cols = ['education', 'employment', 'region']
# Encode with best practices:
# - Explicit column selection
# - drop_first for linear model
# - dummy_na to handle missing education values
# - Integer dtype for sklearn compatibility
df_encoded = pd.get_dummies(
df,
columns=categorical_cols,
drop_first=True,
dummy_na=True,
dtype=int
)
print("\nEncoded data shape:", df_encoded.shape)
print("\nFeature columns:")
print([col for col in df_encoded.columns if col != 'purchased'])
# Prepare features and target
X = df_encoded.drop('purchased', axis=1)
y = df_encoded['purchased']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel accuracy: {accuracy:.3f}")
# Show feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'coefficient': model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)
print("\nTop features by coefficient magnitude:")
print(feature_importance.head(10))
This workflow demonstrates proper categorical encoding for a classification task. Note how we explicitly specify columns, use drop_first=True for the logistic regression model, and handle missing values with dummy_na=True.
A word of caution: get_dummies() works well for quick analysis, but it has a critical flaw for production ML pipelines. If your test set contains a category that wasn’t in the training set, you’ll get mismatched columns. For production systems, use scikit-learn’s OneHotEncoder with handle_unknown='ignore', which maintains consistent column structure regardless of input categories.