How to Create a Confusion Matrix in Python

Key Insights

Confusion matrices reveal the complete picture of classifier performance by showing not just accuracy but the specific types of errors your model makes—critical for imbalanced datasets where a 95% accurate model might miss every minority class instance.
Scikit-learn provides three approaches to confusion matrices: the raw confusion_matrix() function for calculations, ConfusionMatrixDisplay for quick visualization, and classification_report() for derived metrics—use them in combination for comprehensive model evaluation.
Multi-class confusion matrices expose per-class weaknesses that aggregate metrics hide; a model with 80% overall accuracy might have 95% accuracy on common classes but only 40% on rare ones, making the matrix essential for production readiness assessment.

Introduction to Confusion Matrices

A confusion matrix is a table that describes the complete performance of a classification model by comparing predicted labels against actual labels. Unlike simple accuracy scores that hide critical information, confusion matrices expose exactly where and how your model fails.

For binary classification, the matrix has four components:

True Positives (TP): Correctly predicted positive cases
True Negatives (TN): Correctly predicted negative cases
False Positives (FP): Negative cases incorrectly predicted as positive (Type I error)
False Negatives (FN): Positive cases incorrectly predicted as negative (Type II error)

This breakdown matters enormously in real applications. A medical diagnosis model with 95% accuracy sounds impressive until you discover it achieves this by predicting “healthy” for everyone, missing all disease cases. The confusion matrix would immediately reveal this failure through a column of false negatives.

Setting Up Your Environment

Install the required libraries for creating and visualizing confusion matrices:

pip install scikit-learn matplotlib seaborn numpy pandas

For this tutorial, we’ll use the breast cancer dataset from scikit-learn, which provides a realistic binary classification scenario:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Scale features for better model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print(f"Class distribution: {np.bincount(y_test)}")

Creating a Basic Confusion Matrix with Scikit-learn

Train a classifier and generate predictions:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# Train classifier
clf = LogisticRegression(max_iter=10000, random_state=42)
clf.fit(X_train_scaled, y_train)

# Generate predictions
y_pred = clf.predict(X_test_scaled)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

The output is a 2x2 NumPy array for binary classification:

[[59  4]
 [ 3 105]]

Reading this matrix: rows represent actual classes, columns represent predictions. The top-left (59) shows true negatives, bottom-right (105) shows true positives, top-right (4) shows false positives, and bottom-left (3) shows false negatives.

You can extract individual values programmatically:

tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")

Visualizing the Confusion Matrix

Raw numbers are hard to interpret. Visualization makes patterns obvious:

from sklearn.metrics import ConfusionMatrixDisplay

# Method 1: Using ConfusionMatrixDisplay (recommended)
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=data.target_names
)
disp.plot(cmap='Blues', values_format='d')
plt.title('Confusion Matrix - Logistic Regression')
plt.show()

For more control over aesthetics, use seaborn:

# Method 2: Custom seaborn heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='RdYlGn',
    xticklabels=data.target_names,
    yticklabels=data.target_names,
    cbar_kws={'label': 'Count'}
)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix Heatmap')
plt.tight_layout()
plt.show()

Add percentage annotations for better context:

# Show both counts and percentages
plt.figure(figsize=(8, 6))
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

# Create annotations combining counts and percentages
annotations = np.array([[f'{count}\n({pct:.1%})' 
                        for count, pct in zip(row_counts, row_pcts)]
                       for row_counts, row_pcts in zip(cm, cm_normalized)])

sns.heatmap(
    cm, 
    annot=annotations, 
    fmt='',
    cmap='Blues',
    xticklabels=data.target_names,
    yticklabels=data.target_names
)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix with Percentages')
plt.show()

Calculating Performance Metrics from the Matrix

Derive standard classification metrics directly from confusion matrix elements:

# Manual calculation
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)
specificity = tn / (tn + fp)

print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall (Sensitivity): {recall:.3f}")
print(f"Specificity: {specificity:.3f}")
print(f"F1-Score: {f1_score:.3f}")

For automated calculation with additional metrics:

from sklearn.metrics import classification_report

print("\nClassification Report:")
print(classification_report(
    y_test, 
    y_pred, 
    target_names=data.target_names
))

This produces a formatted table with precision, recall, F1-score, and support for each class, plus macro and weighted averages.

Handling Multi-class Classification

Confusion matrices scale to multiple classes. Let’s use the iris dataset with three flower species:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# Load multi-class dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Train classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(X_test)

# Create multi-class confusion matrix
cm_multi = confusion_matrix(y_test, y_pred)

# Visualize
plt.figure(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm_multi,
    display_labels=iris.target_names
)
disp.plot(cmap='viridis', values_format='d')
plt.title('Multi-class Confusion Matrix - Iris Dataset')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

For multi-class scenarios, extract per-class metrics:

from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1, support = precision_recall_fscore_support(
    y_test, y_pred, average=None
)

# Create results DataFrame
results_df = pd.DataFrame({
    'Class': iris.target_names,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1,
    'Support': support
})
print(results_df.to_string(index=False))

Analyze where the model confuses classes:

# Find most confused class pairs
cm_flat = cm_multi.flatten()
indices = np.argsort(cm_flat)[::-1]

print("\nMost common confusions:")
for idx in indices[:6]:  # Top 6 cells
    i, j = idx // len(iris.target_names), idx % len(iris.target_names)
    if i != j and cm_multi[i, j] > 0:  # Only off-diagonal
        print(f"{iris.target_names[i]} misclassified as "
              f"{iris.target_names[j]}: {cm_multi[i, j]} times")

Best Practices and Common Pitfalls

Always normalize when comparing across datasets. Raw counts are meaningless without context. Use normalize='true' to show percentages per actual class:

cm_norm = confusion_matrix(y_test, y_pred, normalize='true')

Watch for class imbalance. If you have 95 samples of class A and 5 of class B, a model predicting “A” for everything achieves 95% accuracy but 0% recall on class B. The confusion matrix reveals this immediately.

Use confusion matrices alongside ROC curves and precision-recall curves. Confusion matrices show performance at a single threshold. For probabilistic classifiers, explore different thresholds:

from sklearn.metrics import roc_curve

# Get probability predictions
y_proba = clf.predict_proba(X_test_scaled)[:, 1]

# Try different thresholds
for threshold in [0.3, 0.5, 0.7]:
    y_pred_custom = (y_proba >= threshold).astype(int)
    cm_custom = confusion_matrix(y_test, y_pred_custom)
    print(f"\nThreshold {threshold}:")
    print(cm_custom)

Don’t ignore the confusion matrix’s off-diagonal patterns. Systematic misclassifications indicate feature engineering opportunities or class definition problems. If your fraud detector consistently misclassifies high-value transactions as fraud, you need features that distinguish legitimate high-value activity.

For production systems, monitor confusion matrices over time. Model drift often appears as gradual changes in specific matrix cells before overall accuracy degrades. Set up alerts on false negative rates for critical applications.

Confusion matrices are the foundation of classification evaluation. Master them, and you’ll build better models by understanding not just that they fail, but precisely how and why they fail.