How to Plot the ROC Curve in Python

The ROC (Receiver Operating Characteristic) curve is one of the most important tools for evaluating binary classification models. It visualizes the trade-off between a model's ability to correctly...

Key Insights

  • The ROC curve plots True Positive Rate against False Positive Rate across all classification thresholds, providing a threshold-independent view of classifier performance where AUC scores closer to 1.0 indicate better models.
  • Use sklearn.metrics.roc_curve() and roc_auc_score() to calculate ROC components, then plot with matplotlib—always include the diagonal reference line representing random guessing for context.
  • ROC curves excel at evaluating balanced datasets but can be misleading with severe class imbalance; in those cases, switch to Precision-Recall curves to get a more accurate picture of model performance.

Introduction to ROC Curves

The ROC (Receiver Operating Characteristic) curve is one of the most important tools for evaluating binary classification models. It visualizes the trade-off between a model’s ability to correctly identify positive cases (sensitivity) and its tendency to incorrectly flag negative cases as positive.

At its core, the ROC curve plots two metrics:

  • True Positive Rate (TPR): Also called sensitivity or recall, calculated as TP / (TP + FN). This represents the proportion of actual positives correctly identified.
  • False Positive Rate (FPR): Calculated as FP / (FP + TN). This represents the proportion of actual negatives incorrectly classified as positive.

The curve is generated by varying the classification threshold from 0 to 1 and calculating TPR and FPR at each threshold. A perfect classifier would have an ROC curve that goes straight up to the top-left corner (100% TPR, 0% FPR), while a random classifier produces a diagonal line.

The Area Under the Curve (AUC) summarizes the ROC curve into a single metric ranging from 0 to 1. An AUC of 0.5 indicates random guessing, while 1.0 represents perfect classification. In practice, AUC scores above 0.8 are generally considered good, though this depends heavily on your domain.

Prerequisites and Setup

You’ll need three core libraries: scikit-learn for machine learning utilities, matplotlib for visualization, and numpy for numerical operations.

# Installation
# pip install scikit-learn matplotlib numpy

# Imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score

These imports give you everything needed to generate data, train models, calculate ROC metrics, and create visualizations.

Building a Simple Binary Classifier

Before plotting ROC curves, you need a trained classifier and probability predictions. Here’s a complete example using a synthetic dataset:

# Generate synthetic binary classification dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    random_state=42
)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Generate probability predictions (crucial for ROC)
y_pred_proba = model.predict_proba(X_test)[:, 1]

The key here is using predict_proba() instead of predict(). ROC curves require probability scores, not binary predictions, because they evaluate performance across all possible thresholds. The [:, 1] indexing extracts probabilities for the positive class.

Calculating ROC Curve Components

Scikit-learn makes ROC calculation straightforward with the roc_curve() function:

# Calculate ROC curve components
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)

print(f"AUC Score: {auc_score:.3f}")
print(f"Number of thresholds: {len(thresholds)}")

The function returns three arrays:

  • fpr: False Positive Rates at each threshold
  • tpr: True Positive Rates at each threshold
  • thresholds: The decision thresholds used

The number of thresholds typically equals the number of unique predicted probabilities plus one. For a 300-sample test set, you might get 200-300 thresholds depending on how many unique probability values your model produces.

Plotting the ROC Curve

Now create a professional-looking ROC curve visualization:

def plot_roc_curve(fpr, tpr, auc_score, label=None):
    """
    Plot a single ROC curve with AUC score.
    
    Parameters:
    -----------
    fpr : array
        False positive rates
    tpr : array
        True positive rates
    auc_score : float
        Area under the curve score
    label : str, optional
        Label for the curve
    """
    plt.figure(figsize=(8, 6))
    
    # Plot ROC curve
    if label:
        plt.plot(fpr, tpr, linewidth=2, 
                label=f'{label} (AUC = {auc_score:.3f})')
    else:
        plt.plot(fpr, tpr, linewidth=2, 
                label=f'ROC Curve (AUC = {auc_score:.3f})')
    
    # Plot diagonal reference line (random classifier)
    plt.plot([0, 1], [0, 1], 'k--', linewidth=1, 
            label='Random Classifier (AUC = 0.5)')
    
    # Formatting
    plt.xlabel('False Positive Rate', fontsize=12)
    plt.ylabel('True Positive Rate', fontsize=12)
    plt.title('ROC Curve', fontsize=14, fontweight='bold')
    plt.legend(loc='lower right', fontsize=10)
    plt.grid(alpha=0.3)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    
    plt.tight_layout()
    plt.show()

# Use the function
plot_roc_curve(fpr, tpr, auc_score, label='Logistic Regression')

The diagonal reference line is critical—it represents a classifier that makes random predictions. Any useful model should have a curve well above this line. The closer your curve hugs the top-left corner, the better your classifier performs across all thresholds.

Comparing Multiple Models

Comparing ROC curves from different models on the same plot reveals which performs best:

# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

plt.figure(figsize=(10, 7))

# Plot ROC curve for each model
for name, model in models.items():
    # Train model
    model.fit(X_train, y_train)
    
    # Get probability predictions
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate ROC curve and AUC
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    auc_score = roc_auc_score(y_test, y_pred_proba)
    
    # Plot
    plt.plot(fpr, tpr, linewidth=2, 
            label=f'{name} (AUC = {auc_score:.3f})')

# Add reference line
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, 
        label='Random Classifier')

# Formatting
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.tight_layout()
plt.show()

When comparing models, the one with the highest AUC and the curve closest to the top-left corner generally performs best. However, consider the specific region of the curve that matters for your application. If minimizing false positives is critical, focus on the left side of the curve where FPR is low.

Best Practices and Interpretation

Interpreting AUC Scores:

  • 0.9-1.0: Excellent
  • 0.8-0.9: Good
  • 0.7-0.8: Fair
  • 0.6-0.7: Poor
  • 0.5-0.6: Fail (barely better than random)

An AUC below 0.5 suggests your model is performing worse than random guessing—you might have swapped your class labels.

When ROC Curves Mislead:

ROC curves can be deceptive with imbalanced datasets. If you have 95 negative samples and 5 positive samples, a model that achieves 90% TPR and 10% FPR might seem good (AUC ~0.9). But with 95 negatives, a 10% FPR means ~9.5 false positives—nearly double your actual positive cases.

For imbalanced datasets, use Precision-Recall curves instead:

from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
ap_score = average_precision_score(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, linewidth=2, 
        label=f'AP = {ap_score:.3f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

Common Pitfalls:

  1. Using predicted classes instead of probabilities: Always use predict_proba(), not predict().
  2. Ignoring class imbalance: ROC curves don’t account for class distribution—use PR curves for imbalanced data.
  3. Overfitting to AUC: A high AUC doesn’t guarantee good performance at the threshold you’ll actually use in production.
  4. Comparing across different datasets: AUC scores are only comparable when models are evaluated on the same test set.

Choose your operating point on the ROC curve based on business requirements. If false positives are costly (like in fraud detection), select a threshold with low FPR even if it means lower TPR. If missing positives is worse (like in disease screening), optimize for high TPR.

The ROC curve gives you the full picture of your classifier’s performance, but you still need domain knowledge to select the right threshold for your specific use case.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.