How to Perform the Hosmer-Lemeshow Test in Python

Key Insights

The Hosmer-Lemeshow test evaluates logistic regression calibration by comparing observed outcomes to predicted probabilities across risk groups, with a high p-value (typically > 0.05) indicating good model fit.
Python lacks a built-in implementation, but you can create a reliable custom function using pandas, NumPy, and scipy.stats in under 30 lines of code.
While useful, the test has known limitations—it’s sensitive to sample size and the number of groups chosen, so always pair it with calibration plots and other metrics like the Brier score.

Introduction to the Hosmer-Lemeshow Test

When you build a logistic regression model, accuracy alone doesn’t tell the whole story. A model might correctly classify 85% of cases but still produce poorly calibrated probability estimates. If your model predicts a 70% probability of default for a loan applicant, you need confidence that roughly 70% of similar applicants actually default.

The Hosmer-Lemeshow test addresses this calibration question directly. Developed by David Hosmer and Stanley Lemeshow in 1980, it’s a goodness-of-fit test that answers: “Do the predicted probabilities from my logistic regression model match the observed outcomes?”

The test works by grouping observations into bins (typically 10 deciles) based on predicted probabilities, then comparing the expected number of events in each group to the actual observed count. Large discrepancies signal poor calibration. Small discrepancies—reflected in a high p-value—suggest your model’s probability estimates are trustworthy.

Use this test when you need probability estimates for decision-making, not just classifications. Credit scoring, medical diagnosis, and churn prediction all require well-calibrated probabilities.

Understanding the Test Mechanics

The Hosmer-Lemeshow test follows a straightforward process. First, sort all observations by their predicted probability. Then divide them into g groups (usually 10) of roughly equal size. For each group, calculate the expected number of positive outcomes (sum of predicted probabilities) and compare it to the observed count.

The test statistic follows a chi-square distribution:

$$H = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g(1 - E_g/n_g)}$$

Where O_g is observed positives, E_g is expected positives, and n_g is the group size.

Here’s a manual calculation to build intuition:

import numpy as np
from scipy import stats

# Simulated data: 5 groups for simplicity
observed_positives = np.array([2, 5, 8, 12, 18])
expected_positives = np.array([2.1, 4.8, 7.5, 11.2, 17.9])
group_sizes = np.array([20, 20, 20, 20, 20])

# Calculate H-L statistic
observed_negatives = group_sizes - observed_positives
expected_negatives = group_sizes - expected_positives

# Chi-square components for positives and negatives
chi_sq_pos = ((observed_positives - expected_positives) ** 2) / expected_positives
chi_sq_neg = ((observed_negatives - expected_negatives) ** 2) / expected_negatives

hl_statistic = np.sum(chi_sq_pos + chi_sq_neg)
degrees_of_freedom = len(group_sizes) - 2  # g - 2

p_value = 1 - stats.chi2.cdf(hl_statistic, degrees_of_freedom)

print(f"H-L Statistic: {hl_statistic:.4f}")
print(f"Degrees of Freedom: {degrees_of_freedom}")
print(f"P-value: {p_value:.4f}")

A high p-value (conventionally > 0.05) means you cannot reject the null hypothesis that the model fits well. Unlike most hypothesis tests where you want to reject the null, here you’re hoping to fail to reject it.

Preparing Your Data and Model

Let’s work with a realistic example. We’ll use a credit default dataset and fit a logistic regression model:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic credit data
X, y = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    random_state=42,
    flip_y=0.1  # Add some noise
)

# Create DataFrame for clarity
feature_names = [f'feature_{i}' for i in range(10)]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit logistic regression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Get predicted probabilities for positive class
y_pred_proba = model.predict_proba(X_test)[:, 1]

print(f"Sample size: {len(y_test)}")
print(f"Positive rate: {y_test.mean():.3f}")
print(f"Predicted probability range: [{y_pred_proba.min():.3f}, {y_pred_proba.max():.3f}]")

The predict_proba method returns probabilities for each class. We extract column 1 for the positive class probability—these are the values we’ll evaluate with the Hosmer-Lemeshow test.

Implementing the Hosmer-Lemeshow Test

Python’s statsmodels library doesn’t include a built-in Hosmer-Lemeshow function, so we’ll build a robust implementation:

import pandas as pd
import numpy as np
from scipy import stats

def hosmer_lemeshow_test(y_true, y_pred_proba, n_groups=10):
    """
    Perform the Hosmer-Lemeshow goodness-of-fit test.
    
    Parameters:
    -----------
    y_true : array-like
        Binary observed outcomes (0 or 1)
    y_pred_proba : array-like
        Predicted probabilities for the positive class
    n_groups : int
        Number of groups (default 10 for deciles)
    
    Returns:
    --------
    dict with 'statistic', 'pvalue', 'dof', and 'table'
    """
    # Create DataFrame for manipulation
    data = pd.DataFrame({
        'y_true': y_true,
        'y_pred': y_pred_proba
    })
    
    # Sort by predicted probability and create groups
    data = data.sort_values('y_pred').reset_index(drop=True)
    
    # Use qcut for equal-sized groups, handling duplicates
    try:
        data['group'] = pd.qcut(data['y_pred'], q=n_groups, labels=False, duplicates='drop')
    except ValueError:
        # Fall back to cut if qcut fails
        data['group'] = pd.cut(data['y_pred'], bins=n_groups, labels=False)
    
    # Aggregate by group
    grouped = data.groupby('group').agg(
        n_obs=('y_true', 'count'),
        observed_pos=('y_true', 'sum'),
        expected_pos=('y_pred', 'sum')
    ).reset_index()
    
    grouped['observed_neg'] = grouped['n_obs'] - grouped['observed_pos']
    grouped['expected_neg'] = grouped['n_obs'] - grouped['expected_pos']
    
    # Calculate chi-square statistic
    # Avoid division by zero
    eps = 1e-10
    chi_sq_pos = ((grouped['observed_pos'] - grouped['expected_pos']) ** 2) / \
                 (grouped['expected_pos'] + eps)
    chi_sq_neg = ((grouped['observed_neg'] - grouped['expected_neg']) ** 2) / \
                 (grouped['expected_neg'] + eps)
    
    hl_statistic = (chi_sq_pos + chi_sq_neg).sum()
    
    # Degrees of freedom = number of groups - 2
    actual_groups = grouped['group'].nunique()
    dof = actual_groups - 2
    
    # P-value from chi-square distribution
    p_value = 1 - stats.chi2.cdf(hl_statistic, dof)
    
    return {
        'statistic': hl_statistic,
        'pvalue': p_value,
        'dof': dof,
        'table': grouped
    }

# Run the test
result = hosmer_lemeshow_test(y_test, y_pred_proba)

print(f"Hosmer-Lemeshow Statistic: {result['statistic']:.4f}")
print(f"Degrees of Freedom: {result['dof']}")
print(f"P-value: {result['pvalue']:.4f}")
print("\nGroup-level results:")
print(result['table'].round(3))

This implementation handles edge cases like duplicate probability values and provides a detailed breakdown table for inspection.

Interpreting Results and Visualization

The test output requires careful interpretation. A p-value above 0.05 suggests adequate fit, but don’t stop there. Always visualize calibration:

import matplotlib.pyplot as plt

def plot_calibration(y_true, y_pred_proba, n_bins=10, title="Calibration Plot"):
    """
    Create a calibration plot comparing predicted vs observed probabilities.
    """
    data = pd.DataFrame({
        'y_true': y_true,
        'y_pred': y_pred_proba
    })
    
    # Create bins based on predicted probability
    data['bin'] = pd.qcut(data['y_pred'], q=n_bins, duplicates='drop')
    
    # Calculate mean predicted and observed for each bin
    calibration = data.groupby('bin', observed=True).agg(
        mean_predicted=('y_pred', 'mean'),
        mean_observed=('y_true', 'mean'),
        count=('y_true', 'count')
    ).reset_index()
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Calibration curve
    ax1 = axes[0]
    ax1.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    ax1.scatter(calibration['mean_predicted'], calibration['mean_observed'], 
                s=calibration['count']/5, alpha=0.7, label='Model')
    ax1.plot(calibration['mean_predicted'], calibration['mean_observed'], 
             'b-', alpha=0.5)
    ax1.set_xlabel('Mean Predicted Probability')
    ax1.set_ylabel('Observed Frequency')
    ax1.set_title(title)
    ax1.legend()
    ax1.set_xlim(0, 1)
    ax1.set_ylim(0, 1)
    
    # Bar chart of observed vs expected
    ax2 = axes[1]
    x = range(len(calibration))
    width = 0.35
    ax2.bar([i - width/2 for i in x], calibration['mean_predicted'], 
            width, label='Predicted', alpha=0.7)
    ax2.bar([i + width/2 for i in x], calibration['mean_observed'], 
            width, label='Observed', alpha=0.7)
    ax2.set_xlabel('Decile Group')
    ax2.set_ylabel('Probability')
    ax2.set_title('Predicted vs Observed by Group')
    ax2.legend()
    ax2.set_xticks(x)
    ax2.set_xticklabels([f'D{i+1}' for i in x])
    
    plt.tight_layout()
    plt.savefig('calibration_plot.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    return calibration

calibration_data = plot_calibration(y_test, y_pred_proba)
print("\nCalibration by decile:")
print(calibration_data.round(3))

Well-calibrated models show points clustering along the diagonal. Systematic deviations—curves above or below the line—indicate over- or under-prediction in certain probability ranges.

Limitations and Alternatives

The Hosmer-Lemeshow test has well-documented weaknesses. It’s sensitive to sample size: with large samples (n > 25,000), even trivial miscalibration produces significant p-values. With small samples, it lacks power to detect genuine problems.

The choice of groups matters too. Ten groups is conventional, but different groupings can yield different conclusions. Some researchers recommend 8-12 groups, but there’s no universally optimal choice.

Consider these alternatives:

from sklearn.metrics import brier_score_loss
from sklearn.calibration import calibration_curve

# Brier Score: lower is better, ranges 0-1
brier = brier_score_loss(y_test, y_pred_proba)
print(f"Brier Score: {brier:.4f}")

# Scikit-learn's calibration curve
prob_true, prob_pred = calibration_curve(y_test, y_pred_proba, n_bins=10)
print("\nCalibration curve points:")
for pt, pp in zip(prob_true, prob_pred):
    print(f"  Predicted: {pp:.3f}, Observed: {pt:.3f}")

The Brier score provides a single summary metric. Calibration curves from scikit-learn offer visual assessment. For formal testing with large samples, consider the le Cessie-van Houwelingen test, which is less sensitive to sample size.

Conclusion

The Hosmer-Lemeshow test remains a useful tool for evaluating logistic regression calibration, particularly for moderate sample sizes (500-5,000 observations). Apply it when probability estimates drive decisions, not just classifications.

For robust model validation, combine the Hosmer-Lemeshow test with calibration plots and the Brier score. No single metric tells the complete story. When the test suggests poor fit, examine the group-level table to identify where predictions diverge from observations—this guides model improvement.

Remember: a passing Hosmer-Lemeshow test doesn’t guarantee your model is good, only that its probability estimates roughly match observed frequencies. Always validate with held-out data and consider the full suite of diagnostic tools available.