How to Calculate Precision and Recall in Python

Key Insights

Precision measures how many of your positive predictions were correct, while recall measures how many actual positives you found—choose based on whether false positives or false negatives are more costly for your use case.
Scikit-learn provides built-in functions for calculating these metrics, but understanding the underlying confusion matrix (TP, FP, TN, FN) is essential for debugging model performance and making informed decisions.
The F1-score combines precision and recall into a single metric, but blindly optimizing for it can hurt your model—always consider your business requirements when selecting thresholds and evaluation metrics.

Introduction to Precision and Recall

Accuracy is a terrible metric for most real-world classification problems. If 99% of your emails are legitimate, a model that labels everything as “not spam” achieves 99% accuracy while being completely useless.

Precision and recall solve this problem by focusing on how well your model performs on the class you actually care about. Precision answers: “Of all the items I predicted as positive, how many were actually positive?” Recall answers: “Of all the actual positive items, how many did I correctly identify?”

These metrics derive from the confusion matrix, which breaks down your predictions into four categories:

True Positives (TP): Correctly predicted positive cases
True Negatives (TN): Correctly predicted negative cases
False Positives (FP): Incorrectly predicted as positive (Type I error)
False Negatives (FN): Incorrectly predicted as negative (Type II error)

Here’s a simple visualization:

import matplotlib.pyplot as plt
import numpy as np

# Example confusion matrix values
confusion = np.array([[50, 10],   # TN=50, FP=10
                      [5, 35]])    # FN=5, TP=35

fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(confusion, cmap='Blues')

# Add labels
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(['Predicted Negative', 'Predicted Positive'])
ax.set_yticklabels(['Actually Negative', 'Actually Positive'])

# Add text annotations
for i in range(2):
    for j in range(2):
        text = ax.text(j, i, confusion[i, j],
                      ha="center", va="center", color="black", fontsize=20)

labels = [['TN', 'FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        ax.text(j, i-0.3, labels[i][j],
               ha="center", va="center", color="gray", fontsize=12)

plt.title('Confusion Matrix')
plt.colorbar(im)
plt.tight_layout()
plt.show()

Understanding the Mathematical Formulas

The formulas for precision and recall are straightforward:

Precision = TP / (TP + FP)

Precision tells you the proportion of positive predictions that were correct. High precision means few false alarms.

Recall = TP / (TP + FN)

Recall tells you the proportion of actual positives you successfully identified. High recall means you’re catching most of the positive cases.

The critical insight: you can’t optimize both simultaneously. Predicting everything as positive gives you perfect recall (you caught all positives!) but terrible precision (mostly false alarms). Predicting only your most confident cases gives high precision but poor recall (you missed most positives).

Let’s implement these from scratch:

def calculate_precision(tp, fp):
    """Calculate precision from true positives and false positives."""
    if tp + fp == 0:
        return 0.0
    return tp / (tp + fp)

def calculate_recall(tp, fn):
    """Calculate recall from true positives and false negatives."""
    if tp + fn == 0:
        return 0.0
    return tp / (tp + fn)

# Example: Medical diagnosis
# 35 correct disease predictions (TP)
# 10 false alarms (FP)
# 5 missed cases (FN)

tp, fp, fn = 35, 10, 5

precision = calculate_precision(tp, fp)
recall = calculate_recall(tp, fn)

print(f"Precision: {precision:.2f}")  # 0.78 - 78% of positive predictions were correct
print(f"Recall: {recall:.2f}")        # 0.88 - Caught 88% of actual positive cases

Calculating Precision and Recall with Scikit-learn

Don’t reinvent the wheel. Scikit-learn provides robust implementations that handle edge cases and support multi-class classification:

from sklearn.metrics import precision_score, recall_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate sample dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=15, n_redundant=5,
                          n_classes=2, weights=[0.7, 0.3],
                          random_state=42)

# Split and train
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")

# Get comprehensive report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

For multi-class problems, specify the averaging strategy:

# Multi-class example
y_multiclass = np.random.randint(0, 3, 100)
y_pred_multi = np.random.randint(0, 3, 100)

# Macro: Calculate metric for each class, then average (treats all classes equally)
precision_macro = precision_score(y_multiclass, y_pred_multi, average='macro')

# Weighted: Average weighted by class support (better for imbalanced datasets)
precision_weighted = precision_score(y_multiclass, y_pred_multi, average='weighted')

# Micro: Calculate globally by counting total TP, FP, FN
precision_micro = precision_score(y_multiclass, y_pred_multi, average='micro')

Working with Confusion Matrices

The confusion matrix is your debugging tool. When precision or recall is low, the confusion matrix shows you exactly where your model fails:

from sklearn.metrics import confusion_matrix
import seaborn as sns

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract values for binary classification
tn, fp, fn, tp = cm.ravel()

print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")

# Manual calculation to verify
manual_precision = tp / (tp + fp)
manual_recall = tp / (tp + fn)
print(f"\nManual Precision: {manual_precision:.3f}")
print(f"Manual Recall: {manual_recall:.3f}")

# Visualize with seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

The F1-Score: Balancing Precision and Recall

The F1-score is the harmonic mean of precision and recall. It’s useful when you need a single metric but care about both false positives and false negatives:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean punishes extreme values—an F1 of 0.9 requires both precision and recall to be high.

from sklearn.metrics import f1_score

# Calculate F1
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.3f}")

# Compare across different thresholds
from sklearn.metrics import precision_recall_curve

y_scores = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)

# Calculate F1 for each threshold
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)

# Find optimal threshold
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx] if optimal_idx < len(thresholds) else 0.5

print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"Best F1-Score: {f1_scores[optimal_idx]:.3f}")

The F-beta score lets you weight precision or recall more heavily:

from sklearn.metrics import fbeta_score

# Beta > 1 favors recall, Beta < 1 favors precision
f2_score = fbeta_score(y_test, y_pred, beta=2)  # Recall is 2x more important
f05_score = fbeta_score(y_test, y_pred, beta=0.5)  # Precision is 2x more important

print(f"F2-Score (recall-focused): {f2_score:.3f}")
print(f"F0.5-Score (precision-focused): {f05_score:.3f}")

Practical Example: Spam Detection

Let’s build a spam classifier where the costs of false positives and false negatives differ significantly:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Sample email data
emails = [
    "Win free money now!", "Meeting at 3pm tomorrow",
    "Claim your prize today!", "Project deadline reminder",
    "You've won the lottery!", "Lunch plans?",
    "Free viagra online", "Can you review my code?",
    "Make money from home", "Team standup at 10am"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=spam, 0=ham

# Vectorize and train
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

# Train classifier
clf = MultinomialNB()
clf.fit(X, labels)

# Predictions
y_pred = clf.predict(X)
y_scores = clf.predict_proba(X)[:, 1]

# Evaluate
precision = precision_score(labels, y_pred)
recall = recall_score(labels, y_pred)
f1 = f1_score(labels, y_pred)

print(f"Spam Detection Performance:")
print(f"Precision: {precision:.3f} - {precision*100:.1f}% of spam predictions are correct")
print(f"Recall: {recall:.3f} - Catching {recall*100:.1f}% of actual spam")
print(f"F1-Score: {f1:.3f}")

# Business interpretation
print("\nBusiness Impact:")
if precision < 0.9:
    print("⚠️  Low precision means legitimate emails are being marked as spam")
if recall < 0.8:
    print("⚠️  Low recall means spam is reaching users' inboxes")

Best Practices and Common Pitfalls

Choose metrics based on cost. In fraud detection, missing fraud (low recall) is expensive. In content moderation, false accusations (low precision) damage user trust. Don’t blindly optimize F1.

Imbalanced datasets skew metrics. If only 1% of cases are positive, even terrible models can have decent precision. Always examine both metrics and the confusion matrix.

Threshold tuning is critical. Most classifiers output probabilities—the 0.5 default threshold is arbitrary:

from sklearn.metrics import precision_recall_curve

# Plot precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)

plt.figure(figsize=(10, 6))
plt.plot(thresholds, precisions[:-1], label='Precision', linewidth=2)
plt.plot(thresholds, recalls[:-1], label='Recall', linewidth=2)
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall vs Threshold')
plt.legend()
plt.grid(True)
plt.show()

# Choose threshold based on business requirements
# Example: Require 95% precision for spam detection
target_precision = 0.95
idx = np.argmax(precisions >= target_precision)
selected_threshold = thresholds[idx]
selected_recall = recalls[idx]

print(f"To achieve {target_precision:.0%} precision:")
print(f"Use threshold: {selected_threshold:.3f}")
print(f"Resulting recall: {selected_recall:.3f}")

The precision-recall curve shows the trade-off across all thresholds. Use it to make informed decisions about where to operate your model based on your specific business constraints.