How to Label Encode in Pandas
Machine learning algorithms work with numbers, not text. When your dataset contains categorical columns like 'color,' 'size,' or 'region,' you need to convert these string values into numerical...
Key Insights
- Label encoding converts categorical text values to integers, which is essential for machine learning algorithms that require numerical input, but it should primarily be used for ordinal data where the order matters.
- Pandas offers two built-in methods (
factorize()andcat.codes) that work well for simple encoding tasks, while scikit-learn’sLabelEncoderintegrates better with ML pipelines and provides inverse transformation capabilities. - Always fit your encoder on training data only and handle unseen categories explicitly to prevent data leakage and runtime errors in production.
Introduction
Machine learning algorithms work with numbers, not text. When your dataset contains categorical columns like “color,” “size,” or “region,” you need to convert these string values into numerical representations before feeding them to a model. Label encoding is one of the simplest approaches: it assigns a unique integer to each category.
This article covers three practical methods for label encoding in Pandas, when to use each approach, and how to avoid common mistakes that can break your ML pipeline.
Understanding Categorical Data
Before encoding anything, you need to understand what type of categorical data you’re working with.
Nominal data has no inherent order. Colors (red, blue, green), countries, or product categories are nominal. The values are just labels with no mathematical relationship between them.
Ordinal data has a meaningful order. T-shirt sizes (S, M, L, XL), education levels (high school, bachelor’s, master’s, PhD), or satisfaction ratings (poor, fair, good, excellent) are ordinal. The sequence matters.
Label encoding works best for ordinal data because the resulting integers (0, 1, 2, 3) preserve the order. For nominal data, label encoding can mislead algorithms into thinking there’s a relationship between categories—the model might interpret “blue = 1” as being closer to “red = 0” than “green = 2.” In those cases, one-hot encoding is usually better.
Let’s create a sample DataFrame to work with throughout this article:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'color': ['red', 'blue', 'green', 'blue', 'red', 'green'],
'size': ['M', 'L', 'S', 'XL', 'M', 'L'],
'condition': ['good', 'excellent', 'fair', 'good', 'poor', 'excellent'],
'price': [29.99, 34.99, 19.99, 39.99, 24.99, 34.99]
})
print(df)
Output:
product_id color size condition price
0 101 red M good 29.99
1 102 blue L excellent 34.99
2 103 green S fair 19.99
3 104 blue XL good 39.99
4 105 red M poor 24.99
5 106 green L excellent 34.99
Method 1: Using Pandas factorize()
The factorize() function is Pandas’ built-in solution for label encoding. It returns a tuple containing the encoded values and an index of unique categories.
# Basic factorize usage
encoded_values, unique_categories = pd.factorize(df['color'])
print("Encoded values:", encoded_values)
print("Categories:", unique_categories)
Output:
Encoded values: [0 1 2 1 0 2]
Categories: Index(['red', 'blue', 'green'], dtype='object')
To add the encoded column to your DataFrame:
df['color_encoded'] = pd.factorize(df['color'])[0]
print(df[['color', 'color_encoded']])
Output:
color color_encoded
0 red 0
1 blue 1
2 green 2
3 blue 1
4 red 0
5 green 2
The factorize() function assigns integers based on the order of first appearance in the data. This is fine for nominal data but problematic for ordinal data where you need a specific order.
You can also use the sort parameter to assign codes alphabetically:
encoded_sorted, categories_sorted = pd.factorize(df['color'], sort=True)
print("Sorted encoding:", encoded_sorted)
print("Sorted categories:", categories_sorted)
Output:
Sorted encoding: [2 0 1 0 2 1]
Sorted categories: Index(['blue', 'green', 'red'], dtype='object')
When to use factorize(): Quick exploratory analysis, nominal categorical data, or when you don’t need to decode values later.
Method 2: Using cat.codes with Category Dtype
Converting a column to Pandas’ category dtype gives you more control, especially for ordinal data where you need to specify the order.
# Basic category conversion
df['color_cat'] = df['color'].astype('category')
df['color_codes'] = df['color_cat'].cat.codes
print(df[['color', 'color_codes']])
For ordinal data, you can specify the exact order:
# Define custom order for sizes
size_order = ['S', 'M', 'L', 'XL']
df['size_ordered'] = pd.Categorical(df['size'], categories=size_order, ordered=True)
df['size_encoded'] = df['size_ordered'].cat.codes
print(df[['size', 'size_encoded']])
Output:
size size_encoded
0 M 1
1 L 2
2 S 0
3 XL 3
4 M 1
5 L 2
Now S=0, M=1, L=2, XL=3—exactly the order we want. This preserves the meaningful relationship between sizes.
Here’s the same approach for the condition column:
condition_order = ['poor', 'fair', 'good', 'excellent']
df['condition_ordered'] = pd.Categorical(
df['condition'],
categories=condition_order,
ordered=True
)
df['condition_encoded'] = df['condition_ordered'].cat.codes
print(df[['condition', 'condition_encoded']])
Output:
condition condition_encoded
0 good 2
1 excellent 3
2 fair 1
3 good 2
4 poor 0
5 excellent 3
When to use cat.codes: Ordinal data where order matters, memory optimization for large datasets (category dtype uses less memory), or when you want to leverage Pandas’ categorical functionality.
Method 3: Using Scikit-learn’s LabelEncoder
Scikit-learn’s LabelEncoder is the standard choice when building ML pipelines because it provides fit/transform semantics and inverse transformation.
from sklearn.preprocessing import LabelEncoder
# Create and fit the encoder
le = LabelEncoder()
df['color_sklearn'] = le.fit_transform(df['color'])
print("Classes:", le.classes_)
print(df[['color', 'color_sklearn']])
Output:
Classes: ['blue' 'green' 'red']
color color_sklearn
0 red 2
1 blue 0
2 green 1
3 blue 0
4 red 2
5 green 1
The real power of LabelEncoder is inverse transformation—converting integers back to original labels:
# Decode the values
original_labels = le.inverse_transform(df['color_sklearn'])
print("Decoded:", original_labels)
Output:
Decoded: ['red' 'blue' 'green' 'blue' 'red' 'green']
This is crucial for model interpretation. After your model makes predictions, you often need to convert encoded values back to human-readable labels.
When to use LabelEncoder: ML pipelines, when you need inverse transformation, or when consistency with scikit-learn’s API matters.
Handling Multiple Columns
Real datasets have multiple categorical columns. Here’s how to encode them efficiently while keeping track of encoders for later decoding.
from sklearn.preprocessing import LabelEncoder
# Identify categorical columns
categorical_cols = ['color', 'size', 'condition']
# Store encoders for each column
encoders = {}
# Create a copy for encoded data
df_encoded = df.copy()
for col in categorical_cols:
le = LabelEncoder()
df_encoded[f'{col}_encoded'] = le.fit_transform(df_encoded[col])
encoders[col] = le
print(df_encoded[['color', 'color_encoded', 'size', 'size_encoded',
'condition', 'condition_encoded']])
Output:
color color_encoded size size_encoded condition condition_encoded
0 red 2 M 1 good 2
1 blue 0 L 0 excellent 0
2 green 1 S 2 fair 1
3 blue 0 XL 3 good 2
4 red 2 M 1 poor 3
5 green 1 L 0 excellent 0
To decode any column later:
# Decode the size column
decoded_sizes = encoders['size'].inverse_transform(df_encoded['size_encoded'])
print("Decoded sizes:", decoded_sizes)
For a more functional approach using apply():
def encode_categorical(df, columns):
"""Encode multiple categorical columns and return encoders."""
df_encoded = df.copy()
encoders = {}
for col in columns:
le = LabelEncoder()
df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
encoders[col] = le
return df_encoded, encoders
# Usage
df_final, all_encoders = encode_categorical(df, ['color', 'size', 'condition'])
Best Practices and Common Pitfalls
Handling NaN Values
Label encoders don’t handle NaN values gracefully. Always deal with missing data first:
# Create data with NaN
df_with_nan = df.copy()
df_with_nan.loc[2, 'color'] = np.nan
# Option 1: Fill with a placeholder
df_filled = df_with_nan.copy()
df_filled['color'] = df_filled['color'].fillna('unknown')
le = LabelEncoder()
df_filled['color_encoded'] = le.fit_transform(df_filled['color'])
# Option 2: Encode then replace NaN positions with -1
df_nan_handled = df_with_nan.copy()
nan_mask = df_nan_handled['color'].isna()
df_nan_handled.loc[~nan_mask, 'color_encoded'] = le.fit_transform(
df_nan_handled.loc[~nan_mask, 'color']
)
df_nan_handled.loc[nan_mask, 'color_encoded'] = -1
Handling Unseen Categories in Test Data
This is where most pipelines break. If your test data contains categories that weren’t in the training data, transform() will raise an error.
# Training data
train_colors = ['red', 'blue', 'green']
le = LabelEncoder()
le.fit(train_colors)
# Test data with unseen category
test_colors = ['red', 'blue', 'yellow'] # 'yellow' is new
# This will fail:
# le.transform(test_colors) # ValueError!
# Solution: Handle unseen categories explicitly
def safe_transform(encoder, values, unknown_value=-1):
"""Transform values, assigning unknown_value to unseen categories."""
known_classes = set(encoder.classes_)
result = []
for val in values:
if val in known_classes:
result.append(encoder.transform([val])[0])
else:
result.append(unknown_value)
return np.array(result)
encoded_test = safe_transform(le, test_colors)
print("Safe encoded:", encoded_test) # [2 0 -1]
Train/Test Consistency
Always fit encoders on training data only, then transform both train and test:
from sklearn.model_selection import train_test_split
# Split data
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)
# Fit on training data only
le = LabelEncoder()
le.fit(train_df['color'])
# Transform both sets
train_df = train_df.copy()
test_df = test_df.copy()
train_df['color_encoded'] = le.transform(train_df['color'])
test_df['color_encoded'] = le.transform(test_df['color'])
Fitting on the entire dataset before splitting causes data leakage—your encoder “sees” test data categories during training, which can inflate your model’s apparent performance.
Memory Considerations
For large datasets with high-cardinality categorical columns, consider using Pandas’ category dtype to reduce memory usage:
# Compare memory usage
df_regular = pd.DataFrame({'col': ['cat_' + str(i % 100) for i in range(1000000)]})
df_category = df_regular.copy()
df_category['col'] = df_category['col'].astype('category')
print(f"Regular: {df_regular.memory_usage(deep=True).sum() / 1e6:.2f} MB")
print(f"Category: {df_category.memory_usage(deep=True).sum() / 1e6:.2f} MB")
Label encoding is straightforward, but the details matter. Choose the right method for your use case, handle edge cases explicitly, and always maintain consistency between training and inference. Your future self debugging a production model will thank you.