How to Use SMOTE in Python

Class imbalance occurs when one class significantly outnumbers others in your dataset. In fraud detection, for example, legitimate transactions might outnumber fraudulent ones by 1000:1. This creates...

Key Insights

  • SMOTE generates synthetic minority class samples by interpolating between existing examples, which outperforms simple oversampling by reducing overfitting and improving model generalization on imbalanced datasets.
  • Always apply SMOTE only to your training data after the train-test split to avoid data leakage—applying it beforehand inflates performance metrics and creates models that fail in production.
  • Standard accuracy is misleading for imbalanced datasets; use precision, recall, F1-score, and ROC-AUC to properly evaluate models trained with SMOTE, and consider that SMOTE isn’t always the best solution for every imbalance problem.

Understanding Class Imbalance and SMOTE

Class imbalance occurs when one class significantly outnumbers others in your dataset. In fraud detection, for example, legitimate transactions might outnumber fraudulent ones by 1000:1. This creates a problem: machine learning algorithms optimized for accuracy will simply predict the majority class and achieve high accuracy while completely failing to identify the minority class you actually care about.

Traditional oversampling duplicates minority class examples, which causes models to overfit on those exact samples. SMOTE (Synthetic Minority Over-sampling Technique) takes a smarter approach by creating synthetic examples. It selects a minority class sample, finds its k nearest neighbors, and generates new samples along the line segments connecting them. This introduces controlled variation that helps models generalize better.

Let’s visualize a typical imbalanced dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Create an imbalanced dataset
X, y = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.95, 0.05],  # 95% class 0, 5% class 1
    random_state=42
)

# Visualize class distribution
unique, counts = np.unique(y, return_counts=True)
plt.bar(['Majority Class', 'Minority Class'], counts)
plt.ylabel('Number of Samples')
plt.title('Class Distribution Before SMOTE')
plt.show()

print(f"Majority class: {counts[0]} samples")
print(f"Minority class: {counts[1]} samples")
print(f"Imbalance ratio: {counts[0]/counts[1]:.1f}:1")

This creates a dataset with approximately 950 majority class samples and 50 minority class samples—a 19:1 imbalance that will cause most classifiers to ignore the minority class.

Installing and Importing Required Libraries

SMOTE is implemented in the imbalanced-learn library, which integrates seamlessly with scikit-learn. Install it with pip:

pip install imbalanced-learn scikit-learn pandas numpy matplotlib

Import the necessary components:

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import pandas as pd
import numpy as np

Note that imbalanced-learn provides its own Pipeline class that works with resampling techniques—use imblearn.pipeline.Pipeline instead of sklearn.pipeline.Pipeline when incorporating SMOTE.

Applying Basic SMOTE

The basic SMOTE implementation is straightforward. You instantiate the SMOTE object and call fit_resample() on your features and labels:

from imblearn.over_sampling import SMOTE
from collections import Counter

# Using the imbalanced dataset from earlier
print("Original dataset distribution:")
print(Counter(y))

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("\nResampled dataset distribution:")
print(Counter(y_resampled))

# Visualize the synthetic samples
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[y==0][:, 0], X[y==0][:, 1], label='Majority', alpha=0.5)
plt.scatter(X[y==1][:, 0], X[y==1][:, 1], label='Minority', alpha=0.5)
plt.title('Original Dataset')
plt.legend()

plt.subplot(1, 2, 2)
plt.scatter(X_resampled[y_resampled==0][:, 0], 
           X_resampled[y_resampled==0][:, 1], 
           label='Majority', alpha=0.5)
plt.scatter(X_resampled[y_resampled==1][:, 0], 
           X_resampled[y_resampled==1][:, 1], 
           label='Minority (with synthetic)', alpha=0.5)
plt.title('After SMOTE')
plt.legend()
plt.tight_layout()
plt.show()

By default, SMOTE balances the classes completely. The minority class now has the same number of samples as the majority class.

SMOTE Variants and Parameters

SMOTE has several important parameters and variants that address different scenarios:

Key Parameters:

  • sampling_strategy: Controls the resampling ratio. Default is 'auto' (balance classes), but you can specify a float (e.g., 0.5 for 1:2 ratio) or dict for multi-class.
  • k_neighbors: Number of nearest neighbors to use for generating synthetic samples. Default is 5. Lower values create samples closer to existing ones; higher values increase diversity but may introduce noise.

Variants:

from imblearn.over_sampling import BorderlineSMOTE, SVMSMOTE, ADASYN

# Standard SMOTE
smote_regular = SMOTE(random_state=42)
X_smote, y_smote = smote_regular.fit_resample(X, y)

# BorderlineSMOTE - focuses on samples near decision boundary
smote_borderline = BorderlineSMOTE(random_state=42)
X_borderline, y_borderline = smote_borderline.fit_resample(X, y)

# SVMSMOTE - uses SVM to identify boundary samples
smote_svm = SVMSMOTE(random_state=42)
X_svm, y_svm = smote_svm.fit_resample(X, y)

# ADASYN - adaptively generates more samples for harder-to-learn examples
adasyn = ADASYN(random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X, y)

# Compare sampling strategies
smote_partial = SMOTE(sampling_strategy=0.5, random_state=42)
X_partial, y_partial = smote_partial.fit_resample(X, y)

print(f"Full SMOTE: {Counter(y_smote)}")
print(f"Partial SMOTE (0.5): {Counter(y_partial)}")
print(f"BorderlineSMOTE: {Counter(y_borderline)}")

BorderlineSMOTE typically performs better than standard SMOTE because it focuses on the decision boundary where classification is most difficult. ADASYN is useful when some minority samples are harder to learn than others.

Integration with Machine Learning Pipelines

The most critical aspect of using SMOTE is applying it correctly within your machine learning workflow. SMOTE must only be applied to training data, never to the entire dataset before splitting:

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split data FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Create pipeline with SMOTE
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit pipeline (SMOTE only affects training data)
pipeline.fit(X_train, y_train)

# Predict on untouched test data
y_pred = pipeline.predict(X_test)

print("Test set predictions completed without data leakage")

This approach ensures that your test set remains completely independent. The SMOTE transformation occurs inside the pipeline, affecting only the training data during the fit process.

Evaluating Model Performance

Accuracy is meaningless for imbalanced datasets. A model that predicts all samples as the majority class achieves high accuracy but zero utility. Use these metrics instead:

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Train model without SMOTE
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

model_no_smote = RandomForestClassifier(random_state=42)
model_no_smote.fit(X_train, y_train)
y_pred_no_smote = model_no_smote.predict(X_test)

# Train model with SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
model_with_smote = RandomForestClassifier(random_state=42)
model_with_smote.fit(X_train_smote, y_train_smote)
y_pred_with_smote = model_with_smote.predict(X_test)

# Compare performance
print("WITHOUT SMOTE:")
print(classification_report(y_test, y_pred_no_smote))
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_no_smote):.3f}\n")

print("WITH SMOTE:")
print(classification_report(y_test, y_pred_with_smote))
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_with_smote):.3f}\n")

# Confusion matrices
print("Confusion Matrix (No SMOTE):")
print(confusion_matrix(y_test, y_pred_no_smote))
print("\nConfusion Matrix (With SMOTE):")
print(confusion_matrix(y_test, y_pred_with_smote))

Focus on recall for the minority class (how many actual positives you caught) and the F1-score (harmonic mean of precision and recall). ROC-AUC gives you a single metric for overall performance across different classification thresholds.

Best Practices and Common Pitfalls

Never apply SMOTE before splitting your data. This is the most common and dangerous mistake. Here’s why:

# WRONG - Data leakage example
smote_wrong = SMOTE(random_state=42)
X_wrong, y_wrong = smote_wrong.fit_resample(X, y)
X_train_wrong, X_test_wrong, y_train_wrong, y_test_wrong = train_test_split(
    X_wrong, y_wrong, test_size=0.3, random_state=42
)
# Your test set now contains synthetic samples derived from training data!

# CORRECT - No data leakage
X_train_right, X_test_right, y_train_right, y_test_right = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
smote_right = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote_right.fit_resample(X_train_right, y_train_right)
# Test set remains pristine

When to use SMOTE:

  • You have a genuine class imbalance (not just slightly unbalanced)
  • You have enough minority class samples (at least 50-100) for meaningful interpolation
  • Your features are continuous or mixed (SMOTE works poorly with purely categorical data)
  • You’ve tried adjusting class weights first and need better results

When to avoid SMOTE:

  • Your dataset is small (SMOTE can’t create information that doesn’t exist)
  • You have extreme outliers in the minority class (SMOTE will amplify them)
  • You’re using algorithms that handle imbalance well (e.g., XGBoost with scale_pos_weight)
  • Your problem requires high precision (SMOTE may increase false positives)

Alternative approaches to consider:

  • Adjusting class weights in your classifier (class_weight='balanced')
  • Using ensemble methods specifically designed for imbalanced data
  • Collecting more minority class samples if possible
  • Combining SMOTE with undersampling (SMOTEENN, SMOTETomek)

SMOTE is a powerful technique, but it’s not magic. It creates synthetic data based on existing patterns, so if your minority class samples don’t contain enough information to distinguish them from the majority class, SMOTE won’t help. Always validate that SMOTE improves your specific use case through proper cross-validation on held-out test data.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.