How to Calculate AUC-ROC in Python

Key Insights

AUC-ROC measures a classifier’s ability to distinguish between classes across all classification thresholds, making it threshold-independent unlike accuracy or F1-score
For imbalanced datasets, precision-recall curves often provide more meaningful insights than ROC curves since ROC can be overly optimistic when negative classes dominate
Understanding manual AUC calculation reveals it’s simply the area under the TPR vs FPR curve, computed using the trapezoidal rule across all decision thresholds

Introduction to AUC-ROC

AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is one of the most widely used metrics for evaluating binary classification models. Unlike accuracy, which depends on a single classification threshold, AUC-ROC evaluates model performance across all possible thresholds, giving you a comprehensive view of how well your model separates positive and negative classes.

The metric ranges from 0 to 1, where 0.5 represents random guessing and 1.0 represents perfect classification. In practice, you’ll rarely see values below 0.5—if you do, your model is performing worse than random chance, and you should invert your predictions.

Use AUC-ROC when you need a single metric that captures overall model discrimination ability, particularly when you’re comparing multiple models or when the optimal threshold isn’t predetermined. However, don’t rely on it exclusively for imbalanced datasets, where precision-recall metrics often tell a more complete story.

Understanding the ROC Curve

The ROC curve plots True Positive Rate (TPR) against False Positive Rate (FPR) at various classification thresholds. TPR, also called sensitivity or recall, measures the proportion of actual positives correctly identified:

TPR = True Positives / (True Positives + False Negatives)

FPR measures the proportion of actual negatives incorrectly classified as positive:

FPR = False Positives / (False Positives + True Negatives)

A perfect classifier has an ROC curve that goes straight up the y-axis to (0,1) and then across to (1,1), creating a right angle. A random classifier produces a diagonal line from (0,0) to (1,1). The further your curve bows toward the top-left corner, the better your model performs.

Here’s how to visualize a basic ROC curve:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Simulate predictions and true labels
np.random.seed(42)
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 0])
y_scores = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.9, 0.3, 0.75, 0.6, 0.45])

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, 
         label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', 
         label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()

Calculating AUC-ROC with scikit-learn

The most straightforward approach uses scikit-learn’s built-in functions. Here’s a complete example using a real dataset:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Get predicted probabilities for the positive class
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate AUC-ROC
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC-ROC Score: {auc_score:.4f}")

# Get ROC curve components
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {auc_score:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Breast Cancer Classification')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

The key is using predict_proba() instead of predict(). You need probability scores, not binary predictions, because AUC-ROC evaluates performance across all thresholds.

Manual AUC Calculation

Understanding how AUC is calculated demystifies the metric. The area under the curve is computed using the trapezoidal rule—essentially summing up trapezoids formed between consecutive points on the ROC curve:

import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score

def calculate_auc_manual(y_true, y_scores):
    """
    Calculate AUC manually using the trapezoidal rule
    """
    # Get FPR and TPR values
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    
    # Calculate AUC using trapezoidal rule
    auc = 0
    for i in range(1, len(fpr)):
        # Width of trapezoid
        width = fpr[i] - fpr[i-1]
        # Average height of trapezoid
        height = (tpr[i] + tpr[i-1]) / 2
        # Add area of trapezoid
        auc += width * height
    
    return auc

# Test with example data
np.random.seed(42)
y_true = np.random.randint(0, 2, 100)
y_scores = np.random.random(100)

# Compare manual calculation with sklearn
manual_auc = calculate_auc_manual(y_true, y_scores)
sklearn_auc = roc_auc_score(y_true, y_scores)

print(f"Manual AUC: {manual_auc:.6f}")
print(f"Sklearn AUC: {sklearn_auc:.6f}")
print(f"Difference: {abs(manual_auc - sklearn_auc):.10f}")

You can also use NumPy’s trapz function for a more concise implementation:

def calculate_auc_numpy(y_true, y_scores):
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    return np.trapz(tpr, fpr)

Multi-class AUC-ROC

Extending AUC-ROC to multi-class problems requires treating it as multiple binary classification problems. The two main approaches are One-vs-Rest (OvR) and One-vs-One (OvO).

from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score
import numpy as np

# Load iris dataset (3 classes)
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Binarize labels for multi-class ROC
y_test_bin = label_binarize(y_test, classes=[0, 1, 2])

# Train One-vs-Rest classifier
classifier = OneVsRestClassifier(LogisticRegression(max_iter=10000))
classifier.fit(X_train, y_train)

# Get probability predictions
y_score = classifier.predict_proba(X_test)

# Calculate macro-averaged AUC (average of per-class AUC)
macro_auc = roc_auc_score(y_test_bin, y_score, average='macro')

# Calculate micro-averaged AUC (aggregate all classes)
micro_auc = roc_auc_score(y_test_bin, y_score, average='micro')

# Calculate weighted AUC (weighted by class support)
weighted_auc = roc_auc_score(y_test_bin, y_score, average='weighted')

print(f"Macro-averaged AUC: {macro_auc:.4f}")
print(f"Micro-averaged AUC: {micro_auc:.4f}")
print(f"Weighted AUC: {weighted_auc:.4f}")

# Per-class AUC
for i in range(3):
    class_auc = roc_auc_score(y_test_bin[:, i], y_score[:, i])
    print(f"Class {i} AUC: {class_auc:.4f}")

Macro-averaging treats all classes equally, while micro-averaging gives more weight to larger classes. Choose based on whether you care more about overall performance or per-class performance.

Interpreting Results and Best Practices

AUC-ROC interpretation follows these general guidelines:

0.90-1.00: Excellent discrimination
0.80-0.90: Good discrimination
0.70-0.80: Fair discrimination
0.60-0.70: Poor discrimination
0.50-0.60: Failure (barely better than random)

However, these thresholds aren’t universal. In medical diagnostics, you might need AUC > 0.95, while in marketing applications, 0.75 might be acceptable.

For imbalanced datasets, compare AUC-ROC with precision-recall curves:

from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
                          n_features=20, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_scores = model.predict_proba(X_test)[:, 1]

# Calculate both metrics
roc_auc = roc_auc_score(y_test, y_scores)
avg_precision = average_precision_score(y_test, y_scores)

print(f"ROC-AUC: {roc_auc:.4f}")
print(f"Average Precision: {avg_precision:.4f}")

# Plot both curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_scores)
ax1.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
ax1.plot([0, 1], [0, 1], 'k--')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curve (Imbalanced Data)')
ax1.legend()
ax1.grid(alpha=0.3)

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_scores)
ax2.plot(recall, precision, label=f'AP = {avg_precision:.3f}')
ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curve (Imbalanced Data)')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

On imbalanced datasets, ROC-AUC can appear deceptively high because the large number of true negatives inflates the metric. Precision-recall curves focus on the positive class and provide a more realistic assessment.

Common pitfalls to avoid:

Using predictions instead of probabilities: Always use predict_proba(), not predict()
Ignoring class imbalance: Supplement with precision-recall analysis
Comparing AUC across different datasets: AUC is dataset-dependent
Treating AUC as the only metric: Combine with confusion matrices and domain-specific metrics

AUC-ROC remains a powerful tool for model evaluation, but use it as part of a comprehensive evaluation strategy, not as your sole decision criterion.