How to Create a Confusion Matrix in R

A confusion matrix is a table that summarizes how well your classification model performs by comparing predicted values against actual values. Every prediction falls into one of four categories: true...

Key Insights

  • Confusion matrices provide the foundation for all classification metrics by showing true positives, true negatives, false positives, and false negatives in a simple table format
  • The caret package’s confusionMatrix() function automatically calculates 20+ performance metrics including sensitivity, specificity, and balanced accuracy, saving significant manual calculation effort
  • Visualizing confusion matrices as heatmaps reveals misclassification patterns that raw numbers obscure, especially critical when dealing with imbalanced datasets or multi-class problems

Introduction to Confusion Matrices

A confusion matrix is a table that summarizes how well your classification model performs by comparing predicted values against actual values. Every prediction falls into one of four categories: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Understanding these four values unlocks every classification metric you’ll ever need.

True positives are instances your model correctly identified as positive. True negatives are correctly identified negatives. False positives are negatives incorrectly labeled as positive (Type I errors). False negatives are positives incorrectly labeled as negative (Type II errors). The confusion matrix arranges these values in a 2x2 grid that makes model performance immediately apparent.

Let’s work with a practical example using a binary classification problem. We’ll predict whether customers will churn based on their behavior:

# Load required libraries
library(caret)
library(ggplot2)

# Create a sample customer churn dataset
set.seed(123)
n <- 500
customer_data <- data.frame(
  tenure = rnorm(n, 30, 15),
  monthly_charges = rnorm(n, 65, 20),
  total_charges = rnorm(n, 2000, 1000),
  support_calls = rpois(n, 2)
)

# Generate churn outcome with some relationship to features
churn_prob <- plogis(-2 + 0.05 * customer_data$support_calls - 
                     0.02 * customer_data$tenure)
customer_data$churn <- factor(rbinom(n, 1, churn_prob), 
                              levels = c(0, 1), 
                              labels = c("No", "Yes"))

# Split into training and test sets
train_idx <- createDataPartition(customer_data$churn, p = 0.7, list = FALSE)
train_data <- customer_data[train_idx, ]
test_data <- customer_data[-train_idx, ]

Building a Simple Classification Model

Before creating a confusion matrix, we need predictions from a classification model. We’ll use logistic regression for its simplicity and interpretability:

# Train logistic regression model
model <- glm(churn ~ tenure + monthly_charges + support_calls, 
             data = train_data, 
             family = binomial)

# Generate predictions on test data
predicted_probs <- predict(model, newdata = test_data, type = "response")

# Convert probabilities to class predictions (threshold = 0.5)
predicted_classes <- factor(ifelse(predicted_probs > 0.5, "Yes", "No"),
                           levels = c("No", "Yes"))

# Extract actual values
actual_classes <- test_data$churn

The threshold of 0.5 is standard but not always optimal. For imbalanced datasets or when false positives and false negatives have different costs, you’ll want to adjust this threshold based on business requirements.

Creating a Confusion Matrix with Base R

The simplest approach uses base R’s table() function. This creates a cross-tabulation of actual versus predicted values:

# Create basic confusion matrix
cm_base <- table(Actual = actual_classes, Predicted = predicted_classes)
print(cm_base)

#         Predicted
# Actual   No Yes
#    No   95  15
#    Yes  22  18

# Calculate accuracy manually
accuracy <- sum(diag(cm_base)) / sum(cm_base)
print(paste("Accuracy:", round(accuracy, 3)))

The diagonal elements (top-left and bottom-right) represent correct predictions. Off-diagonal elements are errors. This matrix shows 95 true negatives (correctly predicted non-churners), 18 true positives (correctly predicted churners), 15 false positives (incorrectly predicted churners), and 22 false negatives (missed churners).

Manual calculation of metrics from this matrix is straightforward but tedious:

# Extract values
TN <- cm_base[1, 1]  # True Negatives
FP <- cm_base[1, 2]  # False Positives
FN <- cm_base[2, 1]  # False Negatives
TP <- cm_base[2, 2]  # True Positives

# Calculate metrics
sensitivity <- TP / (TP + FN)  # Also called recall or true positive rate
specificity <- TN / (TN + FP)  # True negative rate
precision <- TP / (TP + FP)    # Positive predictive value

print(paste("Sensitivity:", round(sensitivity, 3)))
print(paste("Specificity:", round(specificity, 3)))
print(paste("Precision:", round(precision, 3)))

Using the caret Package

The caret package’s confusionMatrix() function automates all these calculations and provides extensive additional metrics:

# Create comprehensive confusion matrix
cm_caret <- confusionMatrix(data = predicted_classes, 
                           reference = actual_classes,
                           positive = "Yes")
print(cm_caret)

This function outputs the confusion matrix plus accuracy, sensitivity, specificity, positive predictive value (precision), negative predictive value, prevalence, detection rate, detection prevalence, balanced accuracy, and more. The positive parameter specifies which class is considered the positive case—critical for correct metric interpretation.

Pay special attention to these metrics:

  • Sensitivity (Recall): Of all actual churners, what percentage did we catch? Critical when missing positives is costly.
  • Specificity: Of all actual non-churners, what percentage did we correctly identify? Important when false alarms are expensive.
  • Balanced Accuracy: Average of sensitivity and specificity, useful for imbalanced datasets where overall accuracy is misleading.
  • Kappa: Agreement between predictions and actuals adjusted for chance agreement, ranges from -1 to 1.
# Extract specific metrics
metrics <- cm_caret$byClass
cat("Sensitivity:", round(metrics["Sensitivity"], 3), "\n")
cat("Specificity:", round(metrics["Specificity"], 3), "\n")
cat("F1 Score:", round(metrics["F1"], 3), "\n")

Visualizing the Confusion Matrix

Numbers are precise, but visualizations reveal patterns. A heatmap makes misclassification patterns immediately obvious:

# Prepare data for visualization
cm_df <- as.data.frame(cm_base)
colnames(cm_df) <- c("Actual", "Predicted", "Frequency")

# Create heatmap
ggplot(cm_df, aes(x = Predicted, y = Actual, fill = Frequency)) +
  geom_tile(color = "white", linewidth = 1.5) +
  geom_text(aes(label = Frequency), size = 8, color = "white") +
  scale_fill_gradient(low = "#3575b5", high = "#d73027") +
  labs(title = "Confusion Matrix - Customer Churn Prediction",
       x = "Predicted Class",
       y = "Actual Class") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12, face = "bold"))

For a more polished visualization with percentage annotations, use the cvms package:

library(cvms)

# Create confusion matrix with cvms
cm_tibble <- tibble::tibble(
  target = actual_classes,
  prediction = predicted_classes
)

# Plot with percentages
plot_confusion_matrix(cm_tibble, 
                     target_col = "target",
                     prediction_col = "prediction",
                     counts_col = NULL,
                     add_normalized = TRUE,
                     add_row_percentages = TRUE,
                     add_col_percentages = TRUE)

Interpreting Metrics and Best Practices

Different classification problems require different metric priorities. For medical diagnosis, high sensitivity prevents missing sick patients. For spam filtering, high precision prevents legitimate emails from being blocked. The F1-score balances both:

# Calculate F1-score manually
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", round(precision, 3), "\n")
cat("Recall:", round(recall, 3), "\n")
cat("F1-Score:", round(f1_score, 3), "\n")

For multi-class problems, confusion matrices expand beyond 2x2. Here’s an example with the iris dataset:

# Multi-class example
data(iris)
set.seed(456)

train_idx_multi <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_iris <- iris[train_idx_multi, ]
test_iris <- iris[-train_idx_multi, ]

# Train random forest model
library(randomForest)
model_multi <- randomForest(Species ~ ., data = train_iris)

# Predictions
pred_multi <- predict(model_multi, test_iris)

# Multi-class confusion matrix
cm_multi <- confusionMatrix(pred_multi, test_iris$Species)
print(cm_multi$table)

# Visualize multi-class matrix
cm_multi_df <- as.data.frame(cm_multi$table)

ggplot(cm_multi_df, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Multi-class Confusion Matrix - Iris Species") +
  theme_minimal()

Multi-class matrices require per-class metrics. The caret package calculates sensitivity and specificity for each class using one-vs-all comparisons. Check cm_multi$byClass for these breakdowns.

Key best practices:

  1. Always specify the positive class explicitly in confusionMatrix() to avoid metric misinterpretation
  2. For imbalanced datasets, prioritize balanced accuracy, F1-score, or Matthews correlation coefficient over raw accuracy
  3. Visualize confusion matrices when presenting to non-technical stakeholders—heatmaps communicate performance instantly
  4. Calculate confidence intervals for metrics when sample sizes are small using confusionMatrix()’s built-in intervals
  5. Store confusion matrices for different threshold values to create ROC curves and find optimal operating points

Confusion matrices are the foundation of classification evaluation. Master their creation, interpretation, and visualization in R, and you’ll diagnose model weaknesses and communicate performance effectively to any audience.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.