How to Implement Ordinal Encoding in Python

Ordinal encoding converts categorical variables with inherent order into numerical values while preserving their ranking. Unlike one-hot encoding, which creates binary columns for each category,...

Key Insights

  • Ordinal encoding preserves natural order in categorical data (like “small < medium < large”), making it essential for features where the sequence matters—unlike one-hot encoding which treats categories as independent.
  • Python offers three main approaches: manual mapping with pandas for full control, scikit-learn’s OrdinalEncoder for production pipelines, and pandas categorical dtype for memory-efficient data analysis.
  • The biggest pitfall is inconsistent encoding between training and test sets; always fit your encoder on training data only and save it for deployment to ensure reproducibility.

Introduction to Ordinal Encoding

Ordinal encoding converts categorical variables with inherent order into numerical values while preserving their ranking. Unlike one-hot encoding, which creates binary columns for each category, ordinal encoding assigns a single integer to each category based on its position in the hierarchy.

Use ordinal encoding when your categories have a meaningful sequence: education levels (High School < Bachelor’s < Master’s < PhD), satisfaction ratings (Poor < Fair < Good < Excellent), or clothing sizes (XS < S < M < L < XL). The algorithm can then interpret the numerical relationship between values, which is crucial for tree-based models and some linear models.

Here’s what ordinal encoding looks like in practice:

import pandas as pd

# Raw categorical data
data = pd.DataFrame({
    'size': ['M', 'L', 'S', 'XL', 'M', 'S'],
    'satisfaction': ['Good', 'Excellent', 'Poor', 'Good', 'Fair', 'Excellent']
})

print("Before encoding:")
print(data)

# After ordinal encoding (we'll implement this shortly)
# size: S=0, M=1, L=2, XL=3
# satisfaction: Poor=0, Fair=1, Good=2, Excellent=3

The key difference from one-hot encoding: ordinal encoding maintains that XL (3) > L (2) > M (1), which is meaningful for ordered data but would be nonsensical for truly nominal categories like colors or country names.

Manual Ordinal Encoding with Pandas

For complete control over your encoding scheme, manual mapping with pandas is the most transparent approach. You define the exact order and use .map() or .replace() to transform your data.

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'size': ['M', 'L', 'S', 'XL', 'M'],
    'education': ['Bachelor', 'Master', 'High School', 'PhD', 'Bachelor']
})

# Define ordinal mappings
size_mapping = {
    'S': 0,
    'M': 1,
    'L': 2,
    'XL': 3
}

education_mapping = {
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
}

# Apply mappings
df['size_encoded'] = df['size'].map(size_mapping)
df['education_encoded'] = df['education'].map(education_mapping)

print(df)

This approach is excellent for small datasets or when you need domain-specific ordering that might not be alphabetical. The downside? You’re responsible for handling missing values and unknown categories manually:

# Handle unknown categories
df['size_encoded'] = df['size'].map(size_mapping).fillna(-1)

# Or raise an error if unexpected values appear
def safe_map(series, mapping):
    result = series.map(mapping)
    if result.isna().any():
        unknown = series[result.isna()].unique()
        raise ValueError(f"Unknown categories found: {unknown}")
    return result

df['size_encoded'] = safe_map(df['size'], size_mapping)

Using Scikit-learn’s OrdinalEncoder

For production machine learning pipelines, scikit-learn’s OrdinalEncoder provides a robust, reusable solution. It integrates seamlessly with other sklearn transformers and handles edge cases more gracefully.

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'size': ['M', 'L', 'S', 'XL', 'M', 'S'],
    'satisfaction': ['Good', 'Excellent', 'Poor', 'Good', 'Fair', 'Excellent']
})

# Define the order for each feature
size_order = ['S', 'M', 'L', 'XL']
satisfaction_order = ['Poor', 'Fair', 'Good', 'Excellent']

# Create encoder with explicit categories
encoder = OrdinalEncoder(
    categories=[size_order, satisfaction_order],
    handle_unknown='use_encoded_value',
    unknown_value=-1
)

# Fit and transform
encoded = encoder.fit_transform(df[['size', 'satisfaction']])

# Create DataFrame with encoded values
df_encoded = pd.DataFrame(
    encoded,
    columns=['size_encoded', 'satisfaction_encoded']
)

print("Encoded data:")
print(df_encoded)

# Inverse transform to get original values back
original = encoder.inverse_transform(encoded)
print("\nInverse transformed:")
print(original)

The handle_unknown='use_encoded_value' parameter is crucial for production systems. When your model encounters a new category (like ‘XXL’ for size), it assigns the specified unknown_value instead of crashing:

# Test with unknown categories
new_data = pd.DataFrame({
    'size': ['XXL', 'M'],
    'satisfaction': ['Outstanding', 'Good']
})

encoded_new = encoder.transform(new_data[['size', 'satisfaction']])
print(encoded_new)  # Unknown categories get -1

Using Category Dtype in Pandas

Pandas’ categorical dtype offers a memory-efficient alternative that’s particularly useful for data analysis workflows. It stores categories once and uses integer codes internally.

import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'education': ['Bachelor', 'Master', 'High School', 'PhD', 'Bachelor', 'Master']
})

# Convert to ordered categorical
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
df['education'] = pd.Categorical(
    df['education'],
    categories=education_order,
    ordered=True
)

# Extract numeric codes
df['education_code'] = df['education'].cat.codes

print(df)
print(f"\nData type: {df['education'].dtype}")
print(f"Is ordered: {df['education'].cat.ordered}")

# You can now do ordered comparisons
print("\nEducation level >= Bachelor:")
print(df[df['education'] >= 'Bachelor'])

This approach shines when you need to perform categorical operations while maintaining order:

# Create satisfaction ratings
df = pd.DataFrame({
    'rating': ['Good', 'Excellent', 'Poor', 'Good', 'Fair']
})

df['rating'] = pd.Categorical(
    df['rating'],
    categories=['Poor', 'Fair', 'Good', 'Excellent'],
    ordered=True
)

# Filter for ratings above Fair
high_ratings = df[df['rating'] > 'Fair']
print(high_ratings)

# Get numeric codes for modeling
df['rating_encoded'] = df['rating'].cat.codes

Handling Edge Cases and Best Practices

The most common mistake in ordinal encoding is fitting your encoder on the entire dataset, including your test set. This causes data leakage and unrealistic performance estimates:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Sample data
df = pd.DataFrame({
    'size': ['S', 'M', 'L', 'XL'] * 25,
    'target': [0, 0, 1, 1] * 25
})

# WRONG: Fitting on all data
encoder_wrong = OrdinalEncoder()
df['size_encoded'] = encoder_wrong.fit_transform(df[['size']])
X_train, X_test, y_train, y_test = train_test_split(
    df[['size_encoded']], df['target'], test_size=0.2
)

# CORRECT: Fit only on training data
X_train, X_test, y_train, y_test = train_test_split(
    df[['size']], df['target'], test_size=0.2
)

encoder_correct = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X_train_encoded = encoder_correct.fit_transform(X_train)
X_test_encoded = encoder_correct.transform(X_test)  # Only transform, not fit_transform

Always save your encoders for production deployment:

from joblib import dump, load
from sklearn.preprocessing import OrdinalEncoder

# Train and save encoder
encoder = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']])
encoder.fit(df[['size']])

# Save to disk
dump(encoder, 'ordinal_encoder.joblib')

# Later, in production
loaded_encoder = load('ordinal_encoder.joblib')
new_data = pd.DataFrame({'size': ['M', 'L']})
encoded = loaded_encoder.transform(new_data)

Handle missing values explicitly before encoding:

# Strategy 1: Treat missing as a separate category
df['size'] = df['size'].fillna('Unknown')
categories = ['Unknown', 'S', 'M', 'L', 'XL']

# Strategy 2: Impute with most frequent value
df['size'] = df['size'].fillna(df['size'].mode()[0])

# Strategy 3: Use -1 for missing after encoding
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
df['size'] = df['size'].fillna('MISSING_VALUE')

Complete Real-World Example

Let’s build a complete pipeline for a customer satisfaction survey dataset:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from joblib import dump

# Create sample customer survey data
np.random.seed(42)
n_samples = 1000

df = pd.DataFrame({
    'customer_id': range(n_samples),
    'service_quality': np.random.choice(['Poor', 'Fair', 'Good', 'Excellent'], n_samples),
    'product_satisfaction': np.random.choice(['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied'], n_samples),
    'support_response': np.random.choice(['Slow', 'Moderate', 'Fast'], n_samples),
    'will_recommend': np.random.choice([0, 1], n_samples)  # Target variable
})

# Define ordinal categories
service_quality_order = ['Poor', 'Fair', 'Good', 'Excellent']
satisfaction_order = ['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
response_order = ['Slow', 'Moderate', 'Fast']

# Split data first
X = df[['service_quality', 'product_satisfaction', 'support_response']]
y = df['will_recommend']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit encoder on training data only
encoder = OrdinalEncoder(
    categories=[service_quality_order, satisfaction_order, response_order],
    handle_unknown='use_encoded_value',
    unknown_value=-1
)

X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_encoded, y_train)

# Evaluate
y_pred = model.predict(X_test_encoded)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.3f}")

# Save encoder and model for production
dump(encoder, 'survey_encoder.joblib')
dump(model, 'survey_model.joblib')

# Simulate production prediction
new_customer = pd.DataFrame({
    'service_quality': ['Good'],
    'product_satisfaction': ['Satisfied'],
    'support_response': ['Fast']
})

new_customer_encoded = encoder.transform(new_customer)
prediction = model.predict(new_customer_encoded)
print(f"\nWill customer recommend? {bool(prediction[0])}")

This complete example demonstrates the entire workflow: proper train-test splitting, fitting encoders only on training data, model training, and saving artifacts for production use. The ordinal encoding preserves the meaningful order in satisfaction ratings, allowing the model to learn that “Excellent” service quality has more positive impact than “Good,” which is more positive than “Fair.”

Remember to document your encoding schemes and validate that the ordinal relationships make sense for your domain. Not all categorical variables should be ordinally encoded—use one-hot encoding for truly nominal features like product categories or geographic regions where no inherent order exists.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.