How to Handle Imbalanced Classes in R

Class imbalance occurs when your target variable has significantly unequal representation across categories. In fraud detection, legitimate transactions might outnumber fraudulent ones 1000:1. In...

Key Insights

  • Class imbalance breaks standard accuracy metrics—a model predicting only the majority class can achieve 95%+ accuracy while being completely useless for the minority class you actually care about.
  • SMOTE and class weighting are your first-line defenses: SMOTE generates synthetic minority examples while class weighting penalizes misclassifying rare cases without changing your data.
  • Always evaluate imbalanced models with precision-recall curves and F1-scores, not accuracy—ROC-AUC can be misleading when one class dominates, but PR-AUC tells the real story.

Understanding Class Imbalance

Class imbalance occurs when your target variable has significantly unequal representation across categories. In fraud detection, legitimate transactions might outnumber fraudulent ones 1000:1. In disease diagnosis, healthy patients vastly outnumber sick ones. This imbalance causes standard machine learning algorithms to develop a lazy bias toward the majority class—why bother learning complex patterns when you can achieve 99% accuracy by always predicting “not fraud”?

The accuracy paradox illustrates this perfectly. A naive model that always predicts the majority class achieves high accuracy while providing zero business value. If 98% of transactions are legitimate, a model that flags nothing as fraud gets 98% accuracy but catches zero fraudsters.

Let’s examine a credit card fraud dataset:

library(dplyr)
library(ggplot2)

# Load credit card fraud data
# Using a simulated dataset for demonstration
set.seed(42)
n_total <- 10000
fraud_rate <- 0.02

credit_data <- data.frame(
  amount = c(rnorm(n_total * (1-fraud_rate), 50, 20),
             rnorm(n_total * fraud_rate, 200, 100)),
  time = runif(n_total, 0, 24),
  fraud = c(rep(0, n_total * (1-fraud_rate)), 
            rep(1, n_total * fraud_rate))
)

# Check class distribution
table(credit_data$fraud)
prop.table(table(credit_data$fraud))

# Visualize imbalance
ggplot(credit_data, aes(x = factor(fraud))) +
  geom_bar(fill = c("steelblue", "coral")) +
  labs(title = "Class Distribution", 
       x = "Fraud (0=No, 1=Yes)", 
       y = "Count") +
  theme_minimal()

This 98:2 ratio represents a severe imbalance that will cripple standard classifiers.

Evaluation Metrics for Imbalanced Data

Accuracy is worthless here. Instead, focus on metrics that explicitly measure performance on the minority class:

  • Precision: Of predicted positives, how many are actually positive? Critical when false positives are costly.
  • Recall (Sensitivity): Of actual positives, how many did we catch? Essential when missing positives is dangerous.
  • F1-Score: Harmonic mean of precision and recall, balancing both concerns.
  • ROC-AUC: Area under the receiver operating characteristic curve, though less informative with severe imbalance.
  • PR-AUC: Area under precision-recall curve, superior for imbalanced datasets.
library(caret)
library(pROC)

# Split data
set.seed(123)
train_idx <- createDataPartition(credit_data$fraud, p = 0.7, list = FALSE)
train_data <- credit_data[train_idx, ]
test_data <- credit_data[-train_idx, ]

# Train a baseline logistic regression
model_baseline <- glm(fraud ~ amount + time, 
                      data = train_data, 
                      family = binomial)

# Predictions
pred_prob <- predict(model_baseline, test_data, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- confusionMatrix(factor(pred_class), 
                               factor(test_data$fraud), 
                               positive = "1")
print(conf_matrix)

# Calculate metrics manually
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Recall"]
f1 <- conf_matrix$byClass["F1"]

cat(sprintf("Precision: %.3f\nRecall: %.3f\nF1-Score: %.3f\n", 
            precision, recall, f1))

# ROC and PR curves
roc_obj <- roc(test_data$fraud, pred_prob)
plot(roc_obj, main = paste("ROC Curve (AUC =", round(auc(roc_obj), 3), ")"))

# PR curve
library(PRROC)
pr_obj <- pr.curve(scores.class0 = pred_prob[test_data$fraud == 1],
                   scores.class1 = pred_prob[test_data$fraud == 0],
                   curve = TRUE)
plot(pr_obj)

Notice how ROC-AUC might look decent (>0.8) even when the model performs poorly on the minority class. PR-AUC provides a more honest assessment.

Resampling Techniques

Resampling modifies your training data to balance class representation:

Undersampling randomly removes majority class examples. Fast but discards potentially useful data.

Oversampling duplicates minority class examples. Simple but risks overfitting.

SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic minority examples by interpolating between existing ones. This is the gold standard for many applications.

library(smotefamily)

# Apply SMOTE
train_features <- train_data[, c("amount", "time")]
train_labels <- train_data$fraud

smote_result <- SMOTE(train_features, train_labels, K = 5, dup_size = 10)
train_smote <- smote_result$data
train_smote$fraud <- train_smote$class
train_smote$class <- NULL

# Check new distribution
table(train_smote$fraud)
prop.table(table(train_smote$fraud))

# Train model on SMOTE data
model_smote <- glm(fraud ~ amount + time, 
                   data = train_smote, 
                   family = binomial)

pred_prob_smote <- predict(model_smote, test_data, type = "response")
pred_class_smote <- ifelse(pred_prob_smote > 0.5, 1, 0)

# Evaluate
conf_matrix_smote <- confusionMatrix(factor(pred_class_smote), 
                                     factor(test_data$fraud), 
                                     positive = "1")
print(conf_matrix_smote$byClass[c("Precision", "Recall", "F1")])

SMOTE typically improves recall substantially—you’ll catch more fraud cases—though precision may decrease slightly as you’ll also flag more false positives.

Algorithm-Level Approaches

Rather than changing your data, adjust how algorithms learn from it. Class weights penalize misclassifying minority examples more heavily.

# Calculate class weights (inverse of frequency)
class_weights <- 1 / prop.table(table(train_data$fraud))
weights <- ifelse(train_data$fraud == 1, 
                  class_weights["1"], 
                  class_weights["0"])

# Weighted logistic regression
model_weighted <- glm(fraud ~ amount + time, 
                      data = train_data, 
                      family = binomial,
                      weights = weights)

# For random forest
library(randomForest)
model_rf_weighted <- randomForest(factor(fraud) ~ amount + time,
                                  data = train_data,
                                  classwt = c("0" = 1, "1" = 50),
                                  ntree = 500)

# For XGBoost
library(xgboost)
train_matrix <- xgb.DMatrix(
  data = as.matrix(train_data[, c("amount", "time")]),
  label = train_data$fraud
)

# Calculate scale_pos_weight
scale_pos_weight <- sum(train_data$fraud == 0) / sum(train_data$fraud == 1)

model_xgb <- xgboost(
  data = train_matrix,
  max_depth = 3,
  eta = 0.3,
  nrounds = 100,
  objective = "binary:logistic",
  scale_pos_weight = scale_pos_weight,
  verbose = 0
)

# Evaluate weighted model
pred_prob_weighted <- predict(model_weighted, test_data, type = "response")
pred_class_weighted <- ifelse(pred_prob_weighted > 0.5, 1, 0)

conf_matrix_weighted <- confusionMatrix(factor(pred_class_weighted), 
                                        factor(test_data$fraud), 
                                        positive = "1")
print(conf_matrix_weighted$byClass[c("Precision", "Recall", "F1")])

Class weighting is computationally efficient and doesn’t artificially inflate your dataset size. It’s my go-to first approach.

Ensemble and Hybrid Methods

Balanced bagging combines ensemble methods with resampling. Each tree in your forest trains on a balanced bootstrap sample.

library(ranger)

# Balanced random forest using ranger
model_balanced_rf <- ranger(
  factor(fraud) ~ amount + time,
  data = train_data,
  num.trees = 500,
  sample.fraction = c(0.8, 0.8),  # Sample fraction per class
  replace = TRUE,
  class.weights = c(1, 50)
)

# Predictions
pred_balanced_rf <- predict(model_balanced_rf, test_data)
conf_matrix_balanced <- confusionMatrix(pred_balanced_rf$predictions, 
                                        factor(test_data$fraud), 
                                        positive = "1")
print(conf_matrix_balanced$byClass[c("Precision", "Recall", "F1")])

# Hybrid: SMOTE + Random Forest
model_hybrid <- randomForest(factor(fraud) ~ amount + time,
                             data = train_smote,
                             ntree = 500,
                             classwt = c("0" = 1, "1" = 5))

pred_hybrid <- predict(model_hybrid, test_data)
conf_matrix_hybrid <- confusionMatrix(pred_hybrid, 
                                      factor(test_data$fraud), 
                                      positive = "1")
print(conf_matrix_hybrid$byClass[c("Precision", "Recall", "F1")])

Practical Comparison and Best Practices

Let’s compare all approaches systematically:

# Function to evaluate models
evaluate_model <- function(predictions, actual, method_name) {
  cm <- confusionMatrix(factor(predictions), 
                        factor(actual), 
                        positive = "1")
  data.frame(
    Method = method_name,
    Precision = cm$byClass["Precision"],
    Recall = cm$byClass["Recall"],
    F1 = cm$byClass["F1"],
    Accuracy = cm$overall["Accuracy"]
  )
}

# Collect results
results <- rbind(
  evaluate_model(pred_class, test_data$fraud, "Baseline"),
  evaluate_model(pred_class_smote, test_data$fraud, "SMOTE"),
  evaluate_model(pred_class_weighted, test_data$fraud, "Class Weights"),
  evaluate_model(pred_balanced_rf$predictions, test_data$fraud, "Balanced RF"),
  evaluate_model(pred_hybrid, test_data$fraud, "SMOTE + RF")
)

print(results)

# Visualize comparison
library(tidyr)
results_long <- results %>%
  pivot_longer(cols = c(Precision, Recall, F1), 
               names_to = "Metric", 
               values_to = "Value")

ggplot(results_long, aes(x = Method, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Performance Comparison Across Methods",
       y = "Score")

Decision Framework:

  • Start with class weights: Computationally cheap, no data inflation, works well with most algorithms.
  • Add SMOTE if recall is insufficient: When you absolutely must catch more minority cases and can tolerate more false positives.
  • Use ensemble methods for production: Random forests with balanced sampling provide robust performance with less hyperparameter sensitivity.
  • Combine techniques cautiously: SMOTE + class weights can be powerful but risks overfitting. Always validate on held-out data.
  • Optimize threshold: Don’t default to 0.5. Plot precision-recall curves and choose a threshold that matches your business requirements.

The “best” technique depends on your cost function. In fraud detection, missing fraud (false negatives) might cost $1000 per case while false alarms (false positives) cost $5 in review time. Optimize for your specific economics, not generic metrics.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.