How to Handle Categorical Features in Python

Categorical features represent discrete values or groups rather than continuous measurements. While numerical features like age or price can be used directly in machine learning models, categorical...

Key Insights

  • Most machine learning algorithms require numerical input, making categorical encoding a critical preprocessing step that directly impacts model performance and training efficiency.
  • One-hot encoding works well for low-cardinality features but creates sparse, high-dimensional data for features with many categories—use target encoding or frequency encoding for high-cardinality cases.
  • Always handle categorical encoding within a proper pipeline to prevent data leakage and ensure consistent transformations between training and production environments.

Introduction to Categorical Features

Categorical features represent discrete values or groups rather than continuous measurements. While numerical features like age or price can be used directly in machine learning models, categorical features like “color,” “city,” or “product category” need transformation into numerical representations.

The challenge isn’t just converting text to numbers—it’s preserving the meaningful relationships in your data while avoiding introducing false patterns. Encoding “red” as 1, “blue” as 2, and “green” as 3 might seem reasonable, but you’ve just told your model that blue is twice as much as red and that green is three times as much. For most algorithms, this creates nonsensical mathematical relationships.

Here’s a simple dataset that demonstrates the problem:

import pandas as pd
import numpy as np

# Sample customer dataset
data = {
    'age': [25, 34, 28, 42, 35],
    'city': ['New York', 'London', 'Paris', 'New York', 'Tokyo'],
    'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'],
    'subscription_type': ['Basic', 'Premium', 'Basic', 'Enterprise', 'Premium'],
    'purchase_amount': [50, 150, 75, 300, 180]
}

df = pd.DataFrame(data)
print(df)

The age and purchase_amount columns are ready for modeling, but city, education, and subscription_type need encoding. The right approach depends on the nature of each categorical feature.

Label Encoding for Ordinal Categories

Label encoding assigns each unique category a sequential integer. This works perfectly for ordinal categories—those with a natural order or ranking. Education levels, satisfaction ratings, and temperature ranges (cold/warm/hot) are classic examples.

The key is that the numerical ordering must match the logical ordering. “High School” should receive a lower number than “PhD” because there’s a meaningful progression.

from sklearn.preprocessing import LabelEncoder

# Create ordinal mapping for education
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
df['education_encoded'] = df['education'].map({
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
})

print(df[['education', 'education_encoded']])

# Using LabelEncoder (be careful - it assigns alphabetically by default)
le = LabelEncoder()
df['sub_type_encoded'] = le.fit_transform(df['subscription_type'])
print("\nSubscription encoding (alphabetical):")
print(df[['subscription_type', 'sub_type_encoded']])
print(f"Mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")

Notice the problem: LabelEncoder sorted alphabetically, making “Basic” = 0, “Enterprise” = 1, “Premium” = 2. If your business logic says Enterprise > Premium > Basic, this encoding is wrong. For ordinal data, manual mapping or OrdinalEncoder with explicit categories is safer.

The limitation of label encoding is clear: use it only when order matters. For nominal categories like city names, label encoding creates false relationships that can severely degrade model performance.

One-Hot Encoding for Nominal Categories

One-hot encoding creates a binary column for each category. If you have three cities, you get three columns, each containing 1 or 0. This avoids implying any mathematical relationship between categories.

# Using pandas get_dummies
cities_dummies = pd.get_dummies(df['city'], prefix='city')
print("Pandas get_dummies output:")
print(cities_dummies)

# Using sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
cities_encoded = ohe.fit_transform(df[['city']])
cities_df = pd.DataFrame(
    cities_encoded,
    columns=ohe.get_feature_names_out(['city'])
)
print("\nSklearn OneHotEncoder output:")
print(cities_df)

The key difference: pd.get_dummies() is quick for exploratory analysis but doesn’t handle new categories in test data. OneHotEncoder with handle_unknown='ignore' creates a zero vector for unseen categories, preventing runtime errors in production.

The dummy variable trap is worth mentioning: if you have n categories, you only need n-1 binary columns because the last category is implied when all others are zero. For linear models, include drop='first' to avoid multicollinearity:

ohe_drop = OneHotEncoder(sparse_output=False, drop='first')
cities_encoded_dropped = ohe_drop.fit_transform(df[['city']])
print(f"Original categories: {len(ohe.categories_[0])}")
print(f"Encoded columns: {cities_encoded_dropped.shape[1]}")

High cardinality is one-hot encoding’s Achilles heel. A feature with 1,000 unique values creates 1,000 columns, leading to sparse matrices, memory issues, and overfitting. When you hit 10+ categories, consider alternatives.

Target Encoding for High-Cardinality Features

Target encoding (also called mean encoding) replaces each category with the mean of the target variable for that category. If customers from New York have an average purchase amount of $150, every “New York” entry becomes 150.

This dramatically reduces dimensionality while capturing the relationship between category and target. The risk is data leakage—you’re using the target variable to create features, which can lead to overfitting.

from category_encoders import TargetEncoder
from sklearn.model_selection import KFold

# Simulate high-cardinality feature
np.random.seed(42)
df['zip_code'] = np.random.randint(10000, 10050, size=len(df))

# Basic target encoding (DON'T do this on training data directly)
zip_means = df.groupby('zip_code')['purchase_amount'].mean()
df['zip_encoded_naive'] = df['zip_code'].map(zip_means)

# Proper target encoding with cross-validation
te = TargetEncoder(cols=['zip_code'], min_samples_leaf=2, smoothing=1.0)

# Simulate train/test split
train_df = df.iloc[:4]
test_df = df.iloc[4:]

# Fit on training data only
te.fit(train_df['zip_code'], train_df['purchase_amount'])
train_df['zip_encoded'] = te.transform(train_df['zip_code'])
test_df['zip_encoded'] = te.transform(test_df['zip_code'])

print("Target encoding with proper train/test separation:")
print(train_df[['zip_code', 'purchase_amount', 'zip_encoded']])

The min_samples_leaf and smoothing parameters prevent overfitting on rare categories by blending category means with the global mean. Always use cross-validation or a holdout approach during training to prevent leakage.

Target encoding shines for features like user IDs, product SKUs, or geographic codes with hundreds or thousands of unique values. It’s particularly powerful in gradient boosting models.

Advanced Techniques: Frequency and Binary Encoding

Frequency encoding replaces categories with their occurrence counts. It’s simple, doesn’t increase dimensionality, and works surprisingly well when frequency correlates with the target.

Binary encoding converts category indices to binary, then splits each bit into a column. For 100 categories, one-hot creates 100 columns, but binary encoding creates only 7 (since 2^7 = 128).

from category_encoders import BinaryEncoder

# Create high-cardinality feature
categories = [f'category_{i}' for i in range(50)]
large_df = pd.DataFrame({
    'product_category': np.random.choice(categories, 1000),
    'target': np.random.randn(1000)
})

# Frequency encoding
freq_encoding = large_df['product_category'].value_counts().to_dict()
large_df['freq_encoded'] = large_df['product_category'].map(freq_encoding)

# Binary encoding
be = BinaryEncoder(cols=['product_category'])
binary_encoded = be.fit_transform(large_df['product_category'])

# One-hot for comparison
ohe_large = OneHotEncoder(sparse_output=False)
onehot_encoded = ohe_large.fit_transform(large_df[['product_category']])

print(f"Original feature: 1 column")
print(f"Frequency encoding: 1 column")
print(f"Binary encoding: {binary_encoded.shape[1]} columns")
print(f"One-hot encoding: {onehot_encoded.shape[1]} columns")
print(f"\nDimensionality reduction: {onehot_encoded.shape[1] / binary_encoded.shape[1]:.1f}x")

Binary encoding is underutilized but extremely effective for tree-based models with high-cardinality features. It maintains some category separation while avoiding the curse of dimensionality.

Handling Missing Categories and Best Practices

Missing values in categorical features need explicit handling. You can create a separate “Unknown” category, use the mode, or apply model-based imputation.

Here’s a complete pipeline that handles everything properly:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor

# Define feature types
ordinal_features = ['education']
nominal_features = ['city']
numeric_features = ['age']

# Create transformers
ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OrdinalEncoder(
        categories=[['High School', 'Bachelor', 'Master', 'PhD', 'Unknown']],
        handle_unknown='use_encoded_value',
        unknown_value=-1
    ))
])

nominal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('ord', ordinal_transformer, ordinal_features),
        ('nom', nominal_transformer, nominal_features),
        ('num', numeric_transformer, numeric_features)
    ])

# Full pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# This pipeline handles everything: missing values, encoding, and modeling
X = df[['education', 'city', 'age']]
y = df['purchase_amount']

model.fit(X, y)
predictions = model.predict(X)
print(f"Pipeline predictions: {predictions}")

This approach ensures that all transformations learned on training data apply consistently to test data. The ColumnTransformer applies different encoding strategies to different feature types, and the Pipeline prevents data leakage by fitting only on training data.

Choose your encoding method based on:

  • Cardinality: Low (< 10) → one-hot; High (> 50) → target or frequency encoding
  • Ordinality: Natural order → label/ordinal encoding; No order → one-hot or target encoding
  • Model type: Linear models prefer one-hot; tree-based models handle label encoding well
  • Target relationship: Strong correlation → target encoding; Weak → one-hot or frequency

Categorical encoding isn’t just a preprocessing checkbox—it’s a modeling decision that affects performance, interpretability, and production reliability. Build pipelines, test multiple approaches, and validate on holdout data to find what works for your specific problem.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.