How to One-Hot Encode in Pandas

One-hot encoding transforms categorical variables into a numerical format that machine learning algorithms can process. Most algorithms expect numerical input, and simply converting categories to...

Key Insights

  • Use pd.get_dummies() for quick, exploratory one-hot encoding, but switch to scikit-learn’s OneHotEncoder when you need consistent transformations between training and test data.
  • Always set drop_first=True when feeding encoded data into linear models to avoid the dummy variable trap and multicollinearity issues.
  • Check column cardinality before encoding—high-cardinality categorical variables can explode your feature space and crash your memory.

Introduction

One-hot encoding transforms categorical variables into a numerical format that machine learning algorithms can process. Most algorithms expect numerical input, and simply converting categories to integers (1, 2, 3) implies an ordinal relationship that doesn’t exist. “Red” isn’t greater than “Blue.”

One-hot encoding creates binary columns for each category. If you have a “color” column with values [“red”, “blue”, “green”], you get three new columns: color_red, color_blue, and color_green. Each row gets a 1 in the column matching its original category and 0s elsewhere.

You’ll use this technique constantly when preparing data for logistic regression, neural networks, tree-based models, and clustering algorithms. Pandas makes it straightforward, but there are several gotchas that can bite you in production.

Understanding the Data

Categorical data comes in two flavors: nominal and ordinal. Nominal categories have no inherent order—colors, cities, product names. Ordinal categories have a meaningful sequence—education levels, satisfaction ratings, size categories.

One-hot encoding works best for nominal data. For ordinal data, you might prefer label encoding that preserves the order, or you might still one-hot encode depending on your model’s assumptions.

Let’s create a sample DataFrame to work with:

import pandas as pd
import numpy as np

# Sample e-commerce data
df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'color': ['red', 'blue', 'green', 'red', 'blue', 'green'],
    'size': ['S', 'M', 'L', 'M', 'S', 'XL'],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA', 'NYC'],
    'price': [29.99, 34.99, 24.99, 31.99, 27.99, 39.99]
})

print(df)

Output:

   product_id  color size     city  price
0         101    red    S      NYC  29.99
1         102   blue    M       LA  34.99
2         103  green    L  Chicago  24.99
3         104    red    M      NYC  31.99
4         105   blue    S       LA  27.99
5         106  green   XL      NYC  39.99

Here, color, size, and city are categorical. product_id and price are numerical. We need to encode the categorical columns before feeding this data into most machine learning models.

Using pd.get_dummies() - The Quick Method

The pd.get_dummies() function is the fastest way to one-hot encode in pandas. It automatically detects categorical columns (object and category dtypes) and converts them.

Basic Usage

# Encode a single column
encoded_color = pd.get_dummies(df['color'])
print(encoded_color)

Output:

   blue  green  red
0     0      0    1
1     1      0    0
2     0      1    0
3     0      0    1
4     1      0    0
5     0      1    0

To encode and merge back into your DataFrame:

# Encode specific columns in the DataFrame
df_encoded = pd.get_dummies(df, columns=['color'])
print(df_encoded)

Output:

   product_id size     city  price  color_blue  color_green  color_red
0         101    S      NYC  29.99           0            0          1
1         102    M       LA  34.99           1            0          0
2         103    L  Chicago  24.99           0            1          0
3         104    M      NYC  31.99           0            0          1
4         105    S       LA  27.99           1            0          0
5         106   XL      NYC  39.99           0            1          0

Multiple Columns with Custom Prefixes

When encoding multiple columns, custom prefixes keep your feature names readable:

# Encode multiple columns with custom prefixes
df_encoded = pd.get_dummies(
    df, 
    columns=['color', 'size', 'city'],
    prefix={'color': 'c', 'size': 'sz', 'city': 'loc'},
    dtype=int  # Use integers instead of booleans
)

print(df_encoded.columns.tolist())

Output:

['product_id', 'price', 'c_blue', 'c_green', 'c_red', 'sz_L', 'sz_M', 'sz_S', 'sz_XL', 'loc_Chicago', 'loc_LA', 'loc_NYC']

The dtype=int parameter ensures you get 0s and 1s instead of True and False, which matters for some downstream processes.

Handling the Dummy Variable Trap

The dummy variable trap occurs when your encoded columns are perfectly multicollinear. If you know color_blue=0 and color_green=0, you can deduce color_red=1. This redundancy causes problems for linear regression and logistic regression—the model can’t compute unique coefficients.

The solution: drop one category per encoded variable using drop_first=True.

# Without drop_first (all categories kept)
df_full = pd.get_dummies(df, columns=['color'], dtype=int)
print("Full encoding:")
print(df_full[['color_blue', 'color_green', 'color_red']].head(3))

# With drop_first (first category dropped)
df_reduced = pd.get_dummies(df, columns=['color'], drop_first=True, dtype=int)
print("\nReduced encoding:")
print(df_reduced[['color_green', 'color_red']].head(3))

Output:

Full encoding:
   color_blue  color_green  color_red
0           0            0          1
1           1            0          0
2           0            1          0

Reduced encoding:
   color_green  color_red
0            0          1
1            0          0
2            1          0

With drop_first=True, “blue” becomes the reference category (both color_green and color_red equal 0). The model interprets coefficients relative to this baseline.

When to use drop_first=True:

  • Linear regression, logistic regression, and other linear models
  • When you need to interpret coefficients

When to keep all categories:

  • Tree-based models (they don’t suffer from multicollinearity)
  • Neural networks
  • When interpretability isn’t a concern

Working with Unknown Categories (Production Considerations)

Here’s where pd.get_dummies() falls short: it creates columns based on the categories present in your data. If your test set contains a category that wasn’t in your training set, you’ll get mismatched columns. If a training category is missing from your test set, same problem.

Using pd.Categorical with Predefined Categories

One pandas-native solution uses pd.Categorical to define all possible categories upfront:

# Define all possible categories
all_colors = ['red', 'blue', 'green', 'yellow']  # yellow might appear in test data

# Training data
train_df = pd.DataFrame({'color': ['red', 'blue', 'green']})
train_df['color'] = pd.Categorical(train_df['color'], categories=all_colors)

# Test data with unseen category handled
test_df = pd.DataFrame({'color': ['yellow', 'red']})
test_df['color'] = pd.Categorical(test_df['color'], categories=all_colors)

# Now both encode consistently
train_encoded = pd.get_dummies(train_df, columns=['color'], dtype=int)
test_encoded = pd.get_dummies(test_df, columns=['color'], dtype=int)

print("Train columns:", train_encoded.columns.tolist())
print("Test columns:", test_encoded.columns.tolist())

Output:

Train columns: ['color_blue', 'color_green', 'color_red', 'color_yellow']
Test columns: ['color_blue', 'color_green', 'color_red', 'color_yellow']

Scikit-learn’s OneHotEncoder for Production

For production pipelines, scikit-learn’s OneHotEncoder is more robust. It fits on training data and applies the same transformation to test data:

from sklearn.preprocessing import OneHotEncoder

# Training data
train_df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red'],
    'size': ['S', 'M', 'L', 'M']
})

# Initialize and fit encoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_df[['color', 'size']])

# Transform training data
train_encoded = encoder.transform(train_df[['color', 'size']])
print("Feature names:", encoder.get_feature_names_out())

# Test data with unknown category
test_df = pd.DataFrame({
    'color': ['purple', 'red'],  # 'purple' is unknown
    'size': ['S', 'XL']          # 'XL' is unknown
})

test_encoded = encoder.transform(test_df[['color', 'size']])
print("\nTest encoded (unknowns become all zeros):")
print(test_encoded)

The handle_unknown='ignore' parameter ensures unknown categories result in all-zero rows for that feature rather than throwing an error.

Common Pitfalls and Best Practices

Check Cardinality Before Encoding

High-cardinality columns (many unique values) will explode your feature space. A column with 10,000 unique categories becomes 10,000 new columns.

def check_cardinality(df, threshold=50):
    """Identify columns that might cause memory issues when one-hot encoded."""
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    
    high_cardinality = []
    for col in categorical_cols:
        n_unique = df[col].nunique()
        if n_unique > threshold:
            high_cardinality.append((col, n_unique))
            print(f"⚠️  '{col}': {n_unique} unique values - consider alternative encoding")
        else:
            print(f"✓  '{col}': {n_unique} unique values - safe to one-hot encode")
    
    return high_cardinality

# Example with problematic data
df_problematic = pd.DataFrame({
    'user_id': [f'user_{i}' for i in range(1000)],  # High cardinality
    'country': ['US', 'UK', 'CA'] * 333 + ['US'],   # Low cardinality
    'product_sku': [f'SKU-{i:05d}' for i in range(1000)]  # High cardinality
})

check_cardinality(df_problematic)

For high-cardinality columns, consider:

  • Target encoding: Replace categories with the mean of the target variable
  • Frequency encoding: Replace categories with their occurrence frequency
  • Hashing: Use feature hashing to limit dimensions
  • Grouping: Combine rare categories into an “other” bucket

Handle NaN Values

By default, pd.get_dummies() ignores NaN values—they get 0s in all encoded columns. If NaN is meaningful, handle it explicitly:

df_with_nan = pd.DataFrame({'color': ['red', 'blue', None, 'red']})

# Default behavior: NaN becomes all zeros
print(pd.get_dummies(df_with_nan, columns=['color'], dtype=int))

# Treat NaN as a category
df_with_nan['color'] = df_with_nan['color'].fillna('unknown')
print(pd.get_dummies(df_with_nan, columns=['color'], dtype=int))

Memory Optimization

One-hot encoded DataFrames can consume significant memory. Use sparse matrices when possible:

from scipy import sparse

# For very wide encoded data, convert to sparse
df_encoded = pd.get_dummies(df, columns=['color', 'size', 'city'], dtype=np.int8)
sparse_matrix = sparse.csr_matrix(df_encoded.values)

print(f"Dense memory: {df_encoded.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"Sparse memory: {sparse_matrix.data.nbytes / 1024:.2f} KB")

Conclusion

One-hot encoding in pandas comes down to choosing the right tool for your context:

  • Use pd.get_dummies() for quick exploration and when your data pipeline is simple
  • Use drop_first=True for linear models to avoid multicollinearity
  • Use scikit-learn’s OneHotEncoder for production pipelines where train/test consistency matters
  • Check cardinality first—don’t blindly encode columns with thousands of categories

The method you choose matters less than understanding why you’re choosing it. Get the fundamentals right, and you’ll avoid the silent bugs that corrupt model performance downstream.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.