How to Handle Imbalanced Classes in Python

Key Insights

Class imbalance causes models to achieve high accuracy while completely failing to predict minority classes—a 99% accurate fraud detector that never catches fraud is useless
SMOTE and class weighting are your first-line defenses: SMOTE generates synthetic minority samples while class weighting penalizes misclassifications differently without changing your data
Always evaluate imbalanced models with precision, recall, and F1-score instead of accuracy, and use stratified cross-validation to ensure both classes appear in every fold

Understanding Class Imbalance

Class imbalance occurs when one class significantly outnumbers another in your training data. In fraud detection, legitimate transactions might outnumber fraudulent ones 99-to-1. In medical diagnosis, healthy patients vastly outnumber those with rare diseases. This creates a critical problem: machine learning models optimize for overall accuracy, so they learn to predict the majority class exclusively.

The accuracy paradox illustrates this perfectly. A model that predicts “no fraud” for every transaction in a dataset with 1% fraud achieves 99% accuracy while being completely useless. The model has learned nothing except that guessing the majority class yields high accuracy scores.

Let’s examine a real imbalanced dataset:

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns

# Create a synthetic imbalanced dataset
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.95, 0.05],  # 95% class 0, 5% class 1
    random_state=42
)

# Check class distribution
unique, counts = np.unique(y, return_counts=True)
class_dist = pd.Series(counts, index=unique)
print("Class Distribution:")
print(class_dist)
print(f"\nImbalance Ratio: {counts[0]/counts[1]:.2f}:1")

# Visualize distribution
plt.figure(figsize=(8, 5))
sns.barplot(x=class_dist.index, y=class_dist.values)
plt.title('Class Distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

This dataset has a 19:1 imbalance ratio—typical for many real-world problems.

Evaluation Metrics for Imbalanced Data

Accuracy is actively harmful when evaluating imbalanced datasets. You need metrics that focus on minority class performance.

Precision measures how many predicted positives are actually positive. High precision means few false alarms.

Recall (sensitivity) measures how many actual positives you caught. High recall means you’re not missing many cases.

F1-score is the harmonic mean of precision and recall, providing a single metric that balances both concerns.

ROC-AUC measures the model’s ability to distinguish between classes across all classification thresholds, though it can be optimistic for severe imbalances.

PR-AUC (Precision-Recall AUC) is often more informative for imbalanced data because it focuses on the minority class.

Here’s how these metrics expose the accuracy paradox:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, classification_report)
from sklearn.dummy import DummyClassifier

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Baseline: Always predict majority class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)

# Simple logistic regression
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Compare metrics
print("=== Dummy Classifier (Always Majority) ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dummy):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_dummy, zero_division=0):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_dummy):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred_dummy):.3f}")

print("\n=== Logistic Regression ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_lr):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_lr):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred_lr):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_lr):.3f}")

The dummy classifier achieves 95% accuracy but 0% recall—it never identifies the minority class. The F1-score correctly shows this as a failure.

Resampling Techniques

Resampling modifies your training data to balance class distributions. Oversampling duplicates or synthesizes minority class examples. Undersampling removes majority class examples. Hybrid approaches combine both.

SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples by interpolating between existing minority class samples. It’s more sophisticated than simple duplication and reduces overfitting.

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

print("Original distribution:", Counter(y_train))

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("After SMOTE:", Counter(y_train_smote))

# Train on resampled data
lr_smote = LogisticRegression(random_state=42, max_iter=1000)
lr_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = lr_smote.predict(X_test)

print("\n=== Logistic Regression with SMOTE ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_smote):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_smote):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_smote):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred_smote):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_smote):.3f}")

# Hybrid approach: SMOTE + Undersampling
from imblearn.pipeline import Pipeline as ImbPipeline

pipeline = ImbPipeline([
    ('smote', SMOTE(sampling_strategy=0.5, random_state=42)),
    ('undersample', RandomUnderSampler(sampling_strategy=0.8, random_state=42))
])

X_train_hybrid, y_train_hybrid = pipeline.fit_resample(X_train, y_train)
print("\nAfter Hybrid (SMOTE + Undersampling):", Counter(y_train_hybrid))

SMOTE typically improves recall significantly while maintaining reasonable precision. Undersampling risks losing valuable majority class information, so hybrid approaches often work best.

Class Weighting

Instead of resampling, class weighting assigns higher misclassification costs to minority class errors. Most scikit-learn classifiers support a class_weight parameter.

Setting class_weight='balanced' automatically adjusts weights inversely proportional to class frequencies: n_samples / (n_classes * np.bincount(y)).

from sklearn.ensemble import RandomForestClassifier

# Logistic Regression with class weighting
lr_weighted = LogisticRegression(
    class_weight='balanced',
    random_state=42,
    max_iter=1000
)
lr_weighted.fit(X_train, y_train)
y_pred_weighted = lr_weighted.predict(X_test)

print("=== Logistic Regression with Class Weighting ===")
print(f"Precision: {precision_score(y_test, y_pred_weighted):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_weighted):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred_weighted):.3f}")

# Random Forest with class weighting
rf_weighted = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42
)
rf_weighted.fit(X_train, y_train)
y_pred_rf_weighted = rf_weighted.predict(X_test)

print("\n=== Random Forest with Class Weighting ===")
print(f"Precision: {precision_score(y_test, y_pred_rf_weighted):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_rf_weighted):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred_rf_weighted):.3f}")

Class weighting is elegant because it doesn’t modify your data. It’s particularly effective with logistic regression and tree-based models.

Algorithm-Level Approaches

Some algorithms are specifically designed for imbalanced data. BalancedRandomForestClassifier combines random undersampling with ensemble learning. Threshold adjustment modifies the decision boundary for predictions.

from imblearn.ensemble import BalancedRandomForestClassifier

# Balanced Random Forest
brf = BalancedRandomForestClassifier(
    n_estimators=100,
    random_state=42
)
brf.fit(X_train, y_train)
y_pred_brf = brf.predict(X_test)

print("=== Balanced Random Forest ===")
print(f"Precision: {precision_score(y_test, y_pred_brf):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_brf):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred_brf):.3f}")

# Threshold adjustment
y_proba = lr.predict_proba(X_test)[:, 1]

# Find optimal threshold
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"\n=== Threshold Adjustment ===")
print(f"Default threshold: 0.5")
print(f"Optimal threshold: {optimal_threshold:.3f}")

y_pred_adjusted = (y_proba >= optimal_threshold).astype(int)
print(f"Precision: {precision_score(y_test, y_pred_adjusted):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_adjusted):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred_adjusted):.3f}")

Threshold adjustment is powerful because it lets you tune the precision-recall tradeoff after training. Lower thresholds increase recall (catch more positives) at the cost of precision.

Practical Comparison and Best Practices

Let’s compare all techniques systematically:

from sklearn.model_selection import cross_val_score, StratifiedKFold

models = {
    'Baseline LR': LogisticRegression(random_state=42, max_iter=1000),
    'LR + Class Weight': LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
    'LR + SMOTE': LogisticRegression(random_state=42, max_iter=1000),
    'Balanced RF': BalancedRandomForestClassifier(n_estimators=100, random_state=42)
}

results = []
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    if 'SMOTE' in name:
        # Use pipeline for SMOTE
        from imblearn.pipeline import Pipeline
        pipe = Pipeline([
            ('smote', SMOTE(random_state=42)),
            ('classifier', model)
        ])
        f1_scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1')
    else:
        f1_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1')
    
    results.append({
        'Model': name,
        'Mean F1': f1_scores.mean(),
        'Std F1': f1_scores.std()
    })

results_df = pd.DataFrame(results)
print("\n=== Cross-Validation Results ===")
print(results_df.to_string(index=False))

When to use each technique:

Class weighting: Start here. It’s fast, requires no data modification, and works well with most algorithms.
SMOTE: Use when you have limited minority class data and need to generate synthetic examples. Avoid with high-dimensional data where interpolation becomes unreliable.
Undersampling: Use when you have massive datasets and can afford to discard majority class data. Often combined with oversampling.
Balanced algorithms: Use when standard algorithms fail despite weighting. BalancedRandomForest is particularly robust.
Threshold adjustment: Always explore this as a post-processing step. It’s free performance tuning.

Always use stratified cross-validation to ensure both classes appear in every fold. Monitor multiple metrics—optimizing F1-score often provides the best balance, but your business requirements might prioritize precision (fewer false alarms) or recall (catch every case).

The best approach depends on your specific problem, data characteristics, and business constraints. Start with class weighting, add SMOTE if needed, and use ensemble methods as your heavy artillery.