How to Use Get Dummies in Pandas

Machine learning algorithms work with numbers, not strings. When your dataset contains categorical variables like 'red', 'blue', or 'green', you need to convert them into a numerical format. One-hot...

Key Insights

  • pd.get_dummies() transforms categorical variables into binary columns that machine learning algorithms can process, but you must use drop_first=True to avoid multicollinearity in regression models.
  • Always specify the columns parameter explicitly to control which columns get encoded—relying on automatic detection leads to brittle pipelines that break when data types change.
  • For production ML pipelines, consider using scikit-learn’s OneHotEncoder instead, as get_dummies() doesn’t handle unseen categories in test data gracefully.

Introduction to One-Hot Encoding

Machine learning algorithms work with numbers, not strings. When your dataset contains categorical variables like “red”, “blue”, or “green”, you need to convert them into a numerical format. One-hot encoding creates binary columns for each category, where a 1 indicates the presence of that category and 0 indicates absence.

Pandas provides pd.get_dummies() as a quick way to perform this transformation. It’s straightforward for exploratory analysis and prototyping, though it has limitations for production systems that we’ll address later.

Here’s what one-hot encoding looks like in practice:

import pandas as pd

# Original data with categorical column
df = pd.DataFrame({
    'product': ['laptop', 'phone', 'tablet', 'laptop', 'phone'],
    'price': [999, 699, 449, 1299, 899]
})

print("Before encoding:")
print(df)

# Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['product'])

print("\nAfter encoding:")
print(df_encoded)

Output:

Before encoding:
  product  price
0  laptop    999
1   phone    699
2  tablet    449
3  laptop   1299
4   phone    899

After encoding:
   price  product_laptop  product_phone  product_tablet
0    999            True          False           False
1    699           False           True           False
2    449           False          False            True
3   1299            True          False           False
4    899           False           True           False

The categorical product column becomes three binary columns. Each row has exactly one True value across these new columns.

Basic Syntax and Parameters

The function signature contains several parameters worth understanding:

pd.get_dummies(
    data,
    prefix=None,
    prefix_sep='_',
    dummy_na=False,
    columns=None,
    sparse=False,
    drop_first=False,
    dtype=None
)

The most important parameters:

  • columns: List of column names to encode. If None, encodes all object/category dtype columns.
  • prefix: String or list of strings to prepend to output column names.
  • drop_first: Remove the first category to avoid multicollinearity.
  • dtype: Output dtype for new columns. Defaults to bool in recent pandas versions.

Here’s basic usage on a single column:

import pandas as pd

df = pd.DataFrame({
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA'],
    'temperature': [75, 85, 70, 72, 88]
})

# Encode with integer output instead of boolean
encoded = pd.get_dummies(df, columns=['city'], dtype=int)
print(encoded)

Output:

   temperature  city_Chicago  city_LA  city_NYC
0           75             0        0         1
1           85             0        1         0
2           70             1        0         0
3           72             0        0         1
4           88             0        1         0

Using dtype=int produces 0/1 integers instead of True/False booleans, which some ML libraries prefer.

Handling Multiple Categorical Columns

Real datasets typically contain multiple categorical features. You can encode them all in one call:

import pandas as pd

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'M', 'S'],
    'category': ['electronics', 'clothing', 'electronics', 'clothing', 'electronics'],
    'price': [29.99, 49.99, 19.99, 39.99, 24.99]
})

# Encode all categorical columns
encoded = pd.get_dummies(df, columns=['color', 'size', 'category'], dtype=int)
print(encoded)

Output:

   price  color_blue  color_green  color_red  size_L  size_M  size_S  category_clothing  category_electronics
0  29.99           0            0          1       0       0       1                  0                     1
1  49.99           1            0          0       0       1       0                  1                     0
2  19.99           0            1          0       1       0       0                  0                     1
3  39.99           0            0          1       0       1       0                  1                     0
4  24.99           1            0          0       0       0       1                  0                     1

I strongly recommend always specifying the columns parameter explicitly. Without it, get_dummies() automatically encodes all columns with object or category dtype. This creates problems when:

  1. A numeric column gets read as object due to a single malformed value
  2. You add new columns to your pipeline that shouldn’t be encoded
  3. Column dtypes change between environments

Explicit is better than implicit. Specify your columns.

Avoiding the Dummy Variable Trap

The dummy variable trap occurs when encoded columns are perfectly multicollinear—one column can be predicted from the others. If you know a product is not a laptop and not a phone, it must be a tablet. This redundancy causes problems in linear regression and similar models.

The solution is dropping one category per encoded variable:

import pandas as pd

df = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West', 'North'],
    'sales': [100, 150, 120, 180, 110]
})

# Without drop_first - creates multicollinearity
full_encoding = pd.get_dummies(df, columns=['region'], dtype=int)
print("Full encoding (4 columns):")
print(full_encoding)

# With drop_first - avoids multicollinearity
reduced_encoding = pd.get_dummies(df, columns=['region'], drop_first=True, dtype=int)
print("\nReduced encoding (3 columns):")
print(reduced_encoding)

Output:

Full encoding (4 columns):
   sales  region_East  region_North  region_South  region_West
0    100            0             1             0            0
1    150            0             0             1            0
2    120            1             0             0            0
3    180            0             0             0            1
4    110            0             1             0            0

Reduced encoding (3 columns):
   sales  region_North  region_South  region_West
0    100             1             0            0
1    150             0             1            0
2    120             0             0            0
3    180             0             0            1
4    110             1             0            0

With drop_first=True, “East” becomes the reference category. A row with all zeros in the region columns represents East. This preserves all information while eliminating redundancy.

Use drop_first=True for regression models. Tree-based models like Random Forest and XGBoost don’t require this since they’re not affected by multicollinearity.

Customizing Output with Prefixes

Default column names like region_North work fine, but you might want cleaner names for reporting or when column names are long:

import pandas as pd

df = pd.DataFrame({
    'payment_method': ['credit_card', 'paypal', 'bank_transfer'],
    'shipping_option': ['standard', 'express', 'overnight'],
    'amount': [99.99, 149.99, 249.99]
})

# Custom prefixes for cleaner names
encoded = pd.get_dummies(
    df,
    columns=['payment_method', 'shipping_option'],
    prefix=['pay', 'ship'],
    prefix_sep='.',
    dtype=int
)

print(encoded.columns.tolist())

Output:

['amount', 'pay.credit_card', 'pay.paypal', 'pay.bank_transfer', 'ship.standard', 'ship.express', 'ship.overnight']

Using a dot separator (prefix_sep='.') and short prefixes creates more readable feature names, especially when you have many categorical variables.

Working with NaN Values

Missing values in categorical columns need explicit handling. By default, get_dummies() ignores NaN values, leaving all dummy columns as 0 for that row:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'status': ['active', 'inactive', np.nan, 'active', np.nan],
    'value': [100, 200, 150, 175, 125]
})

# Default behavior - NaN rows have all zeros
default_encoding = pd.get_dummies(df, columns=['status'], dtype=int)
print("Default (NaN ignored):")
print(default_encoding)

# With dummy_na - creates explicit NaN indicator
nan_encoding = pd.get_dummies(df, columns=['status'], dummy_na=True, dtype=int)
print("\nWith dummy_na=True:")
print(nan_encoding)

Output:

Default (NaN ignored):
   value  status_active  status_inactive
0    100              1                0
1    200              0                1
2    150              0                0
3    175              1                0
4    125              0                0

With dummy_na=True:
   value  status_active  status_inactive  status_nan
0    100              1                0           0
1    200              0                1           0
2    150              0                0           1
3    175              1                0           0
4    125              0                0           1

Use dummy_na=True when missingness itself carries information. For example, a missing status might indicate incomplete registration, which could be predictive.

Practical Example: Preparing Data for Machine Learning

Let’s combine these concepts in a realistic workflow:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create sample dataset
np.random.seed(42)
n_samples = 1000

df = pd.DataFrame({
    'age': np.random.randint(18, 70, n_samples),
    'income': np.random.randint(30000, 150000, n_samples),
    'education': np.random.choice(['high_school', 'bachelors', 'masters', 'phd'], n_samples),
    'employment': np.random.choice(['full_time', 'part_time', 'self_employed'], n_samples),
    'region': np.random.choice(['urban', 'suburban', 'rural'], n_samples),
    'purchased': np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
})

# Introduce some missing values
df.loc[np.random.choice(n_samples, 50, replace=False), 'education'] = np.nan

print("Original data shape:", df.shape)
print("\nSample rows:")
print(df.head())

# Define categorical columns explicitly
categorical_cols = ['education', 'employment', 'region']

# Encode with best practices:
# - Explicit column selection
# - drop_first for linear model
# - dummy_na to handle missing education values
# - Integer dtype for sklearn compatibility
df_encoded = pd.get_dummies(
    df,
    columns=categorical_cols,
    drop_first=True,
    dummy_na=True,
    dtype=int
)

print("\nEncoded data shape:", df_encoded.shape)
print("\nFeature columns:")
print([col for col in df_encoded.columns if col != 'purchased'])

# Prepare features and target
X = df_encoded.drop('purchased', axis=1)
y = df_encoded['purchased']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel accuracy: {accuracy:.3f}")

# Show feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("\nTop features by coefficient magnitude:")
print(feature_importance.head(10))

This workflow demonstrates proper categorical encoding for a classification task. Note how we explicitly specify columns, use drop_first=True for the logistic regression model, and handle missing values with dummy_na=True.

A word of caution: get_dummies() works well for quick analysis, but it has a critical flaw for production ML pipelines. If your test set contains a category that wasn’t in the training set, you’ll get mismatched columns. For production systems, use scikit-learn’s OneHotEncoder with handle_unknown='ignore', which maintains consistent column structure regardless of input categories.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.