How to Calculate F1 Score in Python

Key Insights

F1 score is the harmonic mean of precision and recall, making it superior to accuracy for imbalanced datasets where one class dominates
The choice of averaging method (micro, macro, weighted) dramatically affects F1 scores in multi-class problems—macro treats all classes equally while weighted accounts for class imbalance
F1 score punishes extreme trade-offs between precision and recall more harshly than arithmetic mean, forcing you to build balanced classifiers

Introduction to F1 Score

Accuracy is a liar. When 95% of your dataset belongs to one class, a model that blindly predicts that class achieves 95% accuracy while learning nothing. This is where F1 score becomes essential.

F1 score is the harmonic mean of precision and recall, providing a single metric that balances both false positives and false negatives. Unlike accuracy, which treats all errors equally, F1 score focuses specifically on your positive class performance. This makes it invaluable for imbalanced datasets in domains like fraud detection, medical diagnosis, spam filtering, and anomaly detection.

The harmonic mean penalizes extreme values. If your model has 100% precision but 10% recall, the arithmetic mean would give you 55%, but F1 score delivers a more honest 18%. This forces you to build models that perform well on both metrics simultaneously.

Understanding the Components: Precision and Recall

Before calculating F1 score, you need to understand its building blocks: precision and recall. Both metrics derive from the confusion matrix, which categorizes predictions into four groups.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Example predictions and true labels
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

# Calculate confusion matrix components
true_positives = np.sum((y_true == 1) & (y_pred == 1))
false_positives = np.sum((y_true == 0) & (y_pred == 1))
true_negatives = np.sum((y_true == 0) & (y_pred == 0))
false_negatives = np.sum((y_true == 1) & (y_pred == 0))

# Create confusion matrix
conf_matrix = np.array([[true_negatives, false_positives],
                        [false_negatives, true_positives]])

print(f"True Positives: {true_positives}")
print(f"False Positives: {false_positives}")
print(f"True Negatives: {true_negatives}")
print(f"False Negatives: {false_negatives}")

# Visualize
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

Precision answers: “Of all instances I predicted as positive, how many were actually positive?” It’s calculated as TP / (TP + FP). High precision means few false alarms.

Recall (also called sensitivity) answers: “Of all actual positive instances, how many did I correctly identify?” It’s calculated as TP / (TP + FN). High recall means you’re catching most positive cases.

The trade-off is fundamental: being more conservative increases precision but decreases recall, while being more liberal does the opposite. You can’t maximize both without improving your model.

The F1 Score Formula

F1 score combines precision and recall using the harmonic mean:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why harmonic mean instead of arithmetic? The harmonic mean is always less than or equal to the arithmetic mean, and it severely penalizes imbalanced values. When one metric is low, F1 score will be low regardless of how high the other metric is.

# Manual F1 score calculation from scratch
def calculate_metrics(y_true, y_pred):
    """Calculate precision, recall, and F1 score manually."""
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Calculate confusion matrix components
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    
    # Calculate metrics
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1

# Example usage
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

precision, recall, f1 = calculate_metrics(y_true, y_pred)
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")

# Compare different scenarios
scenarios = [
    ("Balanced", [1, 1, 1, 1], [1, 1, 0, 0]),
    ("High Precision, Low Recall", [1, 1, 1, 1], [1, 0, 0, 0]),
    ("Low Precision, High Recall", [1, 1, 1, 1], [1, 1, 1, 0])
]

for name, true, pred in scenarios:
    p, r, f = calculate_metrics(true, pred)
    print(f"\n{name}:")
    print(f"  Precision: {p:.3f}, Recall: {r:.3f}, F1: {f:.3f}")

Calculating F1 Score with Scikit-learn

In practice, use scikit-learn’s optimized implementation rather than rolling your own.

from sklearn.metrics import f1_score, precision_score, recall_score

# Binary classification
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

f1 = f1_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print(f"Binary Classification:")
print(f"F1 Score: {f1:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")

For multi-class problems, the averaging method matters significantly:

# Multi-class classification
y_true_multi = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0]
y_pred_multi = [0, 2, 1, 0, 1, 2, 0, 2, 2, 0]

# Macro: Calculate F1 for each class, then average (treats classes equally)
f1_macro = f1_score(y_true_multi, y_pred_multi, average='macro')

# Micro: Calculate globally by counting total TP, FP, FN
f1_micro = f1_score(y_true_multi, y_pred_multi, average='micro')

# Weighted: Calculate F1 for each class, average weighted by support
f1_weighted = f1_score(y_true_multi, y_pred_multi, average='weighted')

# Per-class F1 scores
f1_per_class = f1_score(y_true_multi, y_pred_multi, average=None)

print(f"\nMulti-class F1 Scores:")
print(f"Macro (unweighted average): {f1_macro:.3f}")
print(f"Micro (global): {f1_micro:.3f}")
print(f"Weighted (by support): {f1_weighted:.3f}")
print(f"Per-class: {f1_per_class}")

Macro averaging treats all classes equally—good when you care about rare classes. Micro averaging aggregates contributions globally—essentially becomes accuracy for balanced datasets. Weighted averaging accounts for class imbalance—useful for reporting overall performance while acknowledging imbalance.

F1 Score in Real-World Scenarios

Let’s build a fraud detection model where F1 score reveals what accuracy hides.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Create imbalanced dataset (1% fraud rate)
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15,
                          n_redundant=5, weights=[0.99, 0.01], 
                          flip_y=0.01, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Compare metrics
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")
print(f"\nDetailed Report:")
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

# Show why accuracy is misleading
print(f"\nFraud cases in test set: {np.sum(y_test)}")
print(f"Fraud cases detected: {np.sum(y_pred)}")
print(f"True positives: {np.sum((y_test == 1) & (y_pred == 1))}")

This example demonstrates why accuracy alone is insufficient. A model could achieve 99% accuracy by predicting everything as legitimate, but its F1 score would be 0% because it catches no fraud.

F1 score weighs precision and recall equally, but sometimes you need to prioritize one over the other. F-beta score generalizes F1 with a configurable beta parameter.

from sklearn.metrics import fbeta_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# F1 score (beta=1): Equal weight
f1 = fbeta_score(y_true, y_pred, beta=1)

# F2 score (beta=2): Emphasizes recall over precision
f2 = fbeta_score(y_true, y_pred, beta=2)

# F0.5 score (beta=0.5): Emphasizes precision over recall
f0_5 = fbeta_score(y_true, y_pred, beta=0.5)

print(f"F1 Score (β=1.0): {f1:.3f}")
print(f"F2 Score (β=2.0): {f2:.3f}")
print(f"F0.5 Score (β=0.5): {f0_5:.3f}")

# Manual F-beta calculation
precision, recall, _ = calculate_metrics(y_true, y_pred)
beta = 2
f_beta_manual = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)
print(f"\nManual F2 calculation: {f_beta_manual:.3f}")

Use F2 score when false negatives are more costly (medical diagnosis—missing a disease is worse than a false alarm). Use F0.5 score when false positives are more costly (spam detection—blocking legitimate emails is worse than letting spam through).

Best Practices and Common Pitfalls

Choose the right averaging method. For imbalanced multi-class problems, macro averaging reveals per-class performance issues that weighted averaging might hide. If class 2 has terrible F1 but represents only 1% of data, weighted F1 will look fine while macro F1 will expose the problem.

Handle edge cases. When a class has no positive predictions or no actual positives, precision or recall becomes undefined. Scikit-learn returns 0 by default, but you should investigate why your model completely misses a class.

Don’t optimize for F1 blindly. F1 score is a reporting metric, not always the best optimization target. Consider optimizing for precision at a specific recall threshold, or use ROC-AUC during training and F1 for evaluation.

Report multiple metrics. Always show precision and recall alongside F1 score. An F1 of 0.7 could mean balanced 70/70 or imbalanced 95/55—the interpretation differs dramatically.

Set appropriate thresholds. For probabilistic classifiers, the default 0.5 threshold is arbitrary. Use precision-recall curves to find the threshold that achieves your desired precision-recall trade-off, then calculate F1 at that threshold.

from sklearn.metrics import precision_recall_curve

# Get prediction probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Find threshold that maximizes F1
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
best_threshold_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_threshold_idx]

print(f"Best threshold: {best_threshold:.3f}")
print(f"Best F1 score: {f1_scores[best_threshold_idx]:.3f}")

F1 score is a powerful tool for evaluating classification models, particularly on imbalanced datasets. Master its calculation, understand its limitations, and always interpret it in context with precision and recall. Your models—and your stakeholders—will thank you.