How to Label Encode in Pandas

Key Insights

Label encoding converts categorical text values to integers, which is essential for machine learning algorithms that require numerical input, but it should primarily be used for ordinal data where the order matters.
Pandas offers two built-in methods (factorize() and cat.codes) that work well for simple encoding tasks, while scikit-learn’s LabelEncoder integrates better with ML pipelines and provides inverse transformation capabilities.
Always fit your encoder on training data only and handle unseen categories explicitly to prevent data leakage and runtime errors in production.

Introduction

Machine learning algorithms work with numbers, not text. When your dataset contains categorical columns like “color,” “size,” or “region,” you need to convert these string values into numerical representations before feeding them to a model. Label encoding is one of the simplest approaches: it assigns a unique integer to each category.

This article covers three practical methods for label encoding in Pandas, when to use each approach, and how to avoid common mistakes that can break your ML pipeline.

Understanding Categorical Data

Before encoding anything, you need to understand what type of categorical data you’re working with.

Nominal data has no inherent order. Colors (red, blue, green), countries, or product categories are nominal. The values are just labels with no mathematical relationship between them.

Ordinal data has a meaningful order. T-shirt sizes (S, M, L, XL), education levels (high school, bachelor’s, master’s, PhD), or satisfaction ratings (poor, fair, good, excellent) are ordinal. The sequence matters.

Label encoding works best for ordinal data because the resulting integers (0, 1, 2, 3) preserve the order. For nominal data, label encoding can mislead algorithms into thinking there’s a relationship between categories—the model might interpret “blue = 1” as being closer to “red = 0” than “green = 2.” In those cases, one-hot encoding is usually better.

Let’s create a sample DataFrame to work with throughout this article:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'color': ['red', 'blue', 'green', 'blue', 'red', 'green'],
    'size': ['M', 'L', 'S', 'XL', 'M', 'L'],
    'condition': ['good', 'excellent', 'fair', 'good', 'poor', 'excellent'],
    'price': [29.99, 34.99, 19.99, 39.99, 24.99, 34.99]
})

print(df)

Output:

   product_id  color size  condition  price
0         101    red    M       good  29.99
1         102   blue    L  excellent  34.99
2         103  green    S       fair  19.99
3         104   blue   XL       good  39.99
4         105    red    M       poor  24.99
5         106  green    L  excellent  34.99

Method 1: Using Pandas factorize()

The factorize() function is Pandas’ built-in solution for label encoding. It returns a tuple containing the encoded values and an index of unique categories.

# Basic factorize usage
encoded_values, unique_categories = pd.factorize(df['color'])

print("Encoded values:", encoded_values)
print("Categories:", unique_categories)

Output:

Encoded values: [0 1 2 1 0 2]
Categories: Index(['red', 'blue', 'green'], dtype='object')

To add the encoded column to your DataFrame:

df['color_encoded'] = pd.factorize(df['color'])[0]
print(df[['color', 'color_encoded']])

Output:

   color  color_encoded
0    red              0
1   blue              1
2  green              2
3   blue              1
4    red              0
5  green              2

The factorize() function assigns integers based on the order of first appearance in the data. This is fine for nominal data but problematic for ordinal data where you need a specific order.

You can also use the sort parameter to assign codes alphabetically:

encoded_sorted, categories_sorted = pd.factorize(df['color'], sort=True)
print("Sorted encoding:", encoded_sorted)
print("Sorted categories:", categories_sorted)

Output:

Sorted encoding: [2 0 1 0 2 1]
Sorted categories: Index(['blue', 'green', 'red'], dtype='object')

When to use factorize(): Quick exploratory analysis, nominal categorical data, or when you don’t need to decode values later.

Method 2: Using cat.codes with Category Dtype

Converting a column to Pandas’ category dtype gives you more control, especially for ordinal data where you need to specify the order.

# Basic category conversion
df['color_cat'] = df['color'].astype('category')
df['color_codes'] = df['color_cat'].cat.codes

print(df[['color', 'color_codes']])

For ordinal data, you can specify the exact order:

# Define custom order for sizes
size_order = ['S', 'M', 'L', 'XL']
df['size_ordered'] = pd.Categorical(df['size'], categories=size_order, ordered=True)
df['size_encoded'] = df['size_ordered'].cat.codes

print(df[['size', 'size_encoded']])

Output:

  size  size_encoded
0    M             1
1    L             2
2    S             0
3   XL             3
4    M             1
5    L             2

Now S=0, M=1, L=2, XL=3—exactly the order we want. This preserves the meaningful relationship between sizes.

Here’s the same approach for the condition column:

condition_order = ['poor', 'fair', 'good', 'excellent']
df['condition_ordered'] = pd.Categorical(
    df['condition'], 
    categories=condition_order, 
    ordered=True
)
df['condition_encoded'] = df['condition_ordered'].cat.codes

print(df[['condition', 'condition_encoded']])

Output:

   condition  condition_encoded
0       good                  2
1  excellent                  3
2       fair                  1
3       good                  2
4       poor                  0
5  excellent                  3

When to use cat.codes: Ordinal data where order matters, memory optimization for large datasets (category dtype uses less memory), or when you want to leverage Pandas’ categorical functionality.

Method 3: Using Scikit-learn’s LabelEncoder

Scikit-learn’s LabelEncoder is the standard choice when building ML pipelines because it provides fit/transform semantics and inverse transformation.

from sklearn.preprocessing import LabelEncoder

# Create and fit the encoder
le = LabelEncoder()
df['color_sklearn'] = le.fit_transform(df['color'])

print("Classes:", le.classes_)
print(df[['color', 'color_sklearn']])

Output:

Classes: ['blue' 'green' 'red']
   color  color_sklearn
0    red              2
1   blue              0
2  green              1
3   blue              0
4    red              2
5  green              1

The real power of LabelEncoder is inverse transformation—converting integers back to original labels:

# Decode the values
original_labels = le.inverse_transform(df['color_sklearn'])
print("Decoded:", original_labels)

Output:

Decoded: ['red' 'blue' 'green' 'blue' 'red' 'green']

This is crucial for model interpretation. After your model makes predictions, you often need to convert encoded values back to human-readable labels.

When to use LabelEncoder: ML pipelines, when you need inverse transformation, or when consistency with scikit-learn’s API matters.

Handling Multiple Columns

Real datasets have multiple categorical columns. Here’s how to encode them efficiently while keeping track of encoders for later decoding.

from sklearn.preprocessing import LabelEncoder

# Identify categorical columns
categorical_cols = ['color', 'size', 'condition']

# Store encoders for each column
encoders = {}

# Create a copy for encoded data
df_encoded = df.copy()

for col in categorical_cols:
    le = LabelEncoder()
    df_encoded[f'{col}_encoded'] = le.fit_transform(df_encoded[col])
    encoders[col] = le

print(df_encoded[['color', 'color_encoded', 'size', 'size_encoded', 
                   'condition', 'condition_encoded']])

Output:

   color  color_encoded size  size_encoded  condition  condition_encoded
0    red              2    M             1       good                  2
1   blue              0    L             0  excellent                  0
2  green              1    S             2       fair                  1
3   blue              0   XL             3       good                  2
4    red              2    M             1       poor                  3
5  green              1    L             0  excellent                  0

To decode any column later:

# Decode the size column
decoded_sizes = encoders['size'].inverse_transform(df_encoded['size_encoded'])
print("Decoded sizes:", decoded_sizes)

For a more functional approach using apply():

def encode_categorical(df, columns):
    """Encode multiple categorical columns and return encoders."""
    df_encoded = df.copy()
    encoders = {}
    
    for col in columns:
        le = LabelEncoder()
        df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
        encoders[col] = le
    
    return df_encoded, encoders

# Usage
df_final, all_encoders = encode_categorical(df, ['color', 'size', 'condition'])

Best Practices and Common Pitfalls

Handling NaN Values

Label encoders don’t handle NaN values gracefully. Always deal with missing data first:

# Create data with NaN
df_with_nan = df.copy()
df_with_nan.loc[2, 'color'] = np.nan

# Option 1: Fill with a placeholder
df_filled = df_with_nan.copy()
df_filled['color'] = df_filled['color'].fillna('unknown')

le = LabelEncoder()
df_filled['color_encoded'] = le.fit_transform(df_filled['color'])

# Option 2: Encode then replace NaN positions with -1
df_nan_handled = df_with_nan.copy()
nan_mask = df_nan_handled['color'].isna()
df_nan_handled.loc[~nan_mask, 'color_encoded'] = le.fit_transform(
    df_nan_handled.loc[~nan_mask, 'color']
)
df_nan_handled.loc[nan_mask, 'color_encoded'] = -1

Handling Unseen Categories in Test Data

This is where most pipelines break. If your test data contains categories that weren’t in the training data, transform() will raise an error.

# Training data
train_colors = ['red', 'blue', 'green']
le = LabelEncoder()
le.fit(train_colors)

# Test data with unseen category
test_colors = ['red', 'blue', 'yellow']  # 'yellow' is new

# This will fail:
# le.transform(test_colors)  # ValueError!

# Solution: Handle unseen categories explicitly
def safe_transform(encoder, values, unknown_value=-1):
    """Transform values, assigning unknown_value to unseen categories."""
    known_classes = set(encoder.classes_)
    result = []
    
    for val in values:
        if val in known_classes:
            result.append(encoder.transform([val])[0])
        else:
            result.append(unknown_value)
    
    return np.array(result)

encoded_test = safe_transform(le, test_colors)
print("Safe encoded:", encoded_test)  # [2 0 -1]

Train/Test Consistency

Always fit encoders on training data only, then transform both train and test:

from sklearn.model_selection import train_test_split

# Split data
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)

# Fit on training data only
le = LabelEncoder()
le.fit(train_df['color'])

# Transform both sets
train_df = train_df.copy()
test_df = test_df.copy()

train_df['color_encoded'] = le.transform(train_df['color'])
test_df['color_encoded'] = le.transform(test_df['color'])

Fitting on the entire dataset before splitting causes data leakage—your encoder “sees” test data categories during training, which can inflate your model’s apparent performance.

Memory Considerations

For large datasets with high-cardinality categorical columns, consider using Pandas’ category dtype to reduce memory usage:

# Compare memory usage
df_regular = pd.DataFrame({'col': ['cat_' + str(i % 100) for i in range(1000000)]})
df_category = df_regular.copy()
df_category['col'] = df_category['col'].astype('category')

print(f"Regular: {df_regular.memory_usage(deep=True).sum() / 1e6:.2f} MB")
print(f"Category: {df_category.memory_usage(deep=True).sum() / 1e6:.2f} MB")

Label encoding is straightforward, but the details matter. Choose the right method for your use case, handle edge cases explicitly, and always maintain consistency between training and inference. Your future self debugging a production model will thank you.

Introduction

Understanding Categorical Data

Method 1: Using Pandas factorize()

Method 2: Using cat.codes with Category Dtype

Method 3: Using Scikit-learn’s LabelEncoder

Handling Multiple Columns

Best Practices and Common Pitfalls

Handling NaN Values

Handling Unseen Categories in Test Data

Train/Test Consistency

Memory Considerations

Liked this? There's more.

Similar Articles