How to Implement Target Encoding in Python
Target encoding transforms categorical variables by replacing each category with a statistic derived from the target variable—typically the mean for regression or the probability for classification....
Key Insights
- Target encoding replaces categorical values with the mean of the target variable for each category, reducing dimensionality while preserving predictive power—especially valuable for high-cardinality features where one-hot encoding would create too many columns.
- Proper implementation requires cross-validation or holdout strategies to prevent target leakage; encoding the training set using its own target values without safeguards will cause severe overfitting and unrealistic performance metrics.
- Smoothing techniques blend category-specific means with global means to handle rare categories robustly, preventing extreme encoded values from dominating predictions when sample sizes are small.
Introduction to Target Encoding
Target encoding transforms categorical variables by replacing each category with a statistic derived from the target variable—typically the mean for regression or the probability for classification. Unlike one-hot encoding, which creates binary columns for each category, target encoding maintains a single numerical column regardless of cardinality.
Use target encoding when dealing with high-cardinality categorical features (dozens or hundreds of unique values) where one-hot encoding would explode your feature space. It’s particularly effective for tree-based models like XGBoost and LightGBM, which can leverage the ordinal relationship target encoding creates. Avoid it for linear models when you need to preserve the independence of categories, or when you have very few unique categories where one-hot encoding remains practical.
The primary risk is target leakage. If you naively encode your training data using the target variable without proper cross-validation, your model will memorize the encoding rather than learn generalizable patterns. This produces artificially high training scores that collapse on unseen data.
The Mathematics Behind Target Encoding
The basic formula replaces each category value with the mean target value for that category:
encoded_value(category_i) = mean(target | feature == category_i)
For a categorical feature “City” predicting house prices, if houses in “Seattle” have an average price of $650,000, every “Seattle” entry gets encoded as 650000.
Rare categories present a problem. If only one house in your dataset is from “Spokane” and it sold for $200,000, encoding all “Spokane” entries as 200000 gives excessive weight to a single observation. Smoothing addresses this by blending the category mean with the global mean:
smoothed_value = (n * category_mean + m * global_mean) / (n + m)
Where n is the count of observations in that category and m is the smoothing parameter controlling the blend strength.
Here’s a simple manual calculation:
import pandas as pd
import numpy as np
# Sample dataset
data = pd.DataFrame({
'city': ['Seattle', 'Seattle', 'Portland', 'Portland', 'Spokane', 'Seattle'],
'price': [650000, 700000, 450000, 480000, 200000, 680000]
})
# Calculate category means
category_means = data.groupby('city')['price'].mean()
global_mean = data['price'].mean()
print("Category means:")
print(category_means)
print(f"\nGlobal mean: {global_mean:.2f}")
# Apply smoothing (m=10)
m = 10
category_counts = data.groupby('city').size()
smoothed_encoding = {}
for city in category_means.index:
n = category_counts[city]
smoothed = (n * category_means[city] + m * global_mean) / (n + m)
smoothed_encoding[city] = smoothed
print(f"{city}: count={n}, raw={category_means[city]:.2f}, smoothed={smoothed:.2f}")
Basic Implementation from Scratch
Implementing target encoding correctly requires careful handling of train/test splits. The cardinal rule: never use test set target values during encoding. Here’s a robust implementation using cross-validation:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
def target_encode_cv(X_train, y_train, X_test, categorical_col, n_splits=5, smoothing=10):
"""
Target encode a categorical column using cross-validation to prevent leakage.
Parameters:
- X_train: Training features DataFrame
- y_train: Training target Series
- X_test: Test features DataFrame
- categorical_col: Name of column to encode
- n_splits: Number of CV folds
- smoothing: Smoothing parameter (higher = more regularization)
"""
# Initialize encoded columns
X_train_encoded = X_train.copy()
X_test_encoded = X_test.copy()
# Global mean for smoothing and unseen categories
global_mean = y_train.mean()
# Encode training data using CV
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
X_train_encoded[f'{categorical_col}_encoded'] = 0.0
for train_idx, val_idx in kf.split(X_train):
# Calculate means on training fold
X_fold_train = X_train.iloc[train_idx]
y_fold_train = y_train.iloc[train_idx]
# Compute smoothed means
category_stats = pd.DataFrame({
'sum': y_fold_train.groupby(X_fold_train[categorical_col]).sum(),
'count': y_fold_train.groupby(X_fold_train[categorical_col]).count()
})
category_stats['smoothed_mean'] = (
(category_stats['sum'] + smoothing * global_mean) /
(category_stats['count'] + smoothing)
)
# Apply to validation fold
X_val = X_train.iloc[val_idx]
encoded_values = X_val[categorical_col].map(category_stats['smoothed_mean'])
encoded_values.fillna(global_mean, inplace=True)
X_train_encoded.loc[val_idx, f'{categorical_col}_encoded'] = encoded_values
# Encode test data using full training set
category_stats_full = pd.DataFrame({
'sum': y_train.groupby(X_train[categorical_col]).sum(),
'count': y_train.groupby(X_train[categorical_col]).count()
})
category_stats_full['smoothed_mean'] = (
(category_stats_full['sum'] + smoothing * global_mean) /
(category_stats_full['count'] + smoothing)
)
X_test_encoded[f'{categorical_col}_encoded'] = (
X_test[categorical_col]
.map(category_stats_full['smoothed_mean'])
.fillna(global_mean)
)
return X_train_encoded, X_test_encoded
# Example usage
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
# Load dataset with categorical features
data = fetch_openml('titanic', version=1, as_frame=True, parser='auto')
df = data.frame.dropna(subset=['survived', 'embarked'])
X = df[['embarked', 'pclass']]
y = df['survived'].astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_enc, X_test_enc = target_encode_cv(X_train, y_train, X_test, 'embarked')
print(X_train_enc[['embarked', 'embarked_encoded']].head(10))
Using category_encoders Library
The category_encoders library provides production-ready implementations. Install it with pip install category-encoders.
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Using the same Titanic dataset
X_train, X_test, y_train, y_test = train_test_split(
df[['embarked', 'pclass', 'sex']],
df['survived'].astype(float),
test_size=0.2,
random_state=42
)
# Initialize encoder with smoothing
encoder = TargetEncoder(cols=['embarked', 'sex'], smoothing=10)
# Fit on training data only
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)
print("Encoded training data:")
print(X_train_encoded.head())
# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_encoded, y_train)
y_pred = model.predict(X_test_encoded)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
The TargetEncoder handles smoothing, unseen categories, and multiple columns automatically. The smoothing parameter controls regularization—higher values pull rare categories toward the global mean more aggressively.
Advanced Techniques and Best Practices
For maximum protection against overfitting, implement k-fold target encoding within your cross-validation strategy:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from category_encoders import TargetEncoder
from xgboost import XGBClassifier
# Create pipeline with target encoding
pipeline = Pipeline([
('target_encoder', TargetEncoder(cols=['embarked', 'sex'], smoothing=10)),
('classifier', XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'))
])
# Cross-validate
scores = cross_val_score(
pipeline,
df[['embarked', 'pclass', 'sex']],
df['survived'].astype(float),
cv=5,
scoring='accuracy'
)
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
For leave-one-out encoding, which excludes each sample’s target when computing its encoding:
from category_encoders import LeaveOneOutEncoder
loo_encoder = LeaveOneOutEncoder(cols=['embarked', 'sex'])
X_train_loo = loo_encoder.fit_transform(X_train, y_train)
X_test_loo = loo_encoder.transform(X_test)
Leave-one-out encoding provides stronger protection against overfitting but increases computation time, especially with large datasets.
Real-World Example and Performance Comparison
Let’s compare target encoding against one-hot encoding on a complete workflow:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_auc_score, classification_report
import time
# Prepare data
categorical_cols = ['embarked', 'sex']
X = df[categorical_cols + ['pclass', 'age', 'fare']].copy()
X['age'].fillna(X['age'].median(), inplace=True)
X['fare'].fillna(X['fare'].median(), inplace=True)
y = df['survived'].astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Approach 1: One-Hot Encoding
start = time.time()
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_train_ohe = ohe.fit_transform(X_train[categorical_cols])
X_test_ohe = ohe.transform(X_test[categorical_cols])
# Combine with numerical features
X_train_ohe_full = pd.concat([
pd.DataFrame(X_train_ohe, index=X_train.index),
X_train[['pclass', 'age', 'fare']].reset_index(drop=True)
], axis=1)
X_test_ohe_full = pd.concat([
pd.DataFrame(X_test_ohe, index=X_test.index),
X_test[['pclass', 'age', 'fare']].reset_index(drop=True)
], axis=1)
model_ohe = XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')
model_ohe.fit(X_train_ohe_full, y_train)
y_pred_ohe = model_ohe.predict_proba(X_test_ohe_full)[:, 1]
ohe_time = time.time() - start
# Approach 2: Target Encoding
start = time.time()
encoder = TargetEncoder(cols=categorical_cols, smoothing=10)
X_train_te = encoder.fit_transform(X_train, y_train)
X_test_te = encoder.transform(X_test)
model_te = XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')
model_te.fit(X_train_te, y_train)
y_pred_te = model_te.predict_proba(X_test_te)[:, 1]
te_time = time.time() - start
# Compare results
print(f"One-Hot Encoding - AUC: {roc_auc_score(y_test, y_pred_ohe):.4f}, Time: {ohe_time:.2f}s")
print(f"Target Encoding - AUC: {roc_auc_score(y_test, y_pred_te):.4f}, Time: {te_time:.2f}s")
print(f"\nFeature count - OHE: {X_train_ohe_full.shape[1]}, TE: {X_train_te.shape[1]}")
Common Pitfalls and Troubleshooting
Target leakage prevention checklist:
- Never fit the encoder on the full dataset before splitting
- Use cross-validation or leave-one-out encoding for training data
- Encode test data using statistics computed only from training data
- In production, save the fitted encoder and apply it to new data without refitting
Handling unseen categories:
Always specify a default value (typically the global mean) for categories that appear in test data but not in training data. Both manual implementations and category_encoders handle this, but verify your approach explicitly handles this case.
When target encoding might hurt:
- Small datasets where cross-validation folds become too small for reliable statistics
- Low-cardinality features (2-5 categories) where one-hot encoding is simpler and equally effective
- When model interpretability requires understanding the impact of specific categories rather than their aggregate statistics
Target encoding is a powerful technique that bridges categorical and numerical feature engineering. When implemented correctly with proper cross-validation and smoothing, it enables models to leverage high-cardinality categorical features that would otherwise be impractical to encode. The key is vigilance against target leakage—always validate that your encoding strategy prevents information from the target variable leaking into your features in ways that won’t generalize to production data.