How to Perform the Hosmer-Lemeshow Test in R

Key Insights

The Hosmer-Lemeshow test divides observations into groups based on predicted probabilities and compares observed versus expected outcomes using a chi-square statistic—a non-significant p-value (> 0.05) suggests adequate model fit.
The test is sensitive to the number of groups chosen and sample size, making it unreliable for small datasets or when arbitrary grouping decisions substantially change results.
Modern practice increasingly favors calibration plots over the Hosmer-Lemeshow test because they provide visual insight into where your model succeeds or fails across the probability range.

Introduction to the Hosmer-Lemeshow Test

When you build a logistic regression model, you need to know whether it actually fits your data well. The Hosmer-Lemeshow test is a classic goodness-of-fit test designed specifically for this purpose. It answers a straightforward question: do the predicted probabilities from your model align with the observed outcomes in your data?

The test works by dividing your observations into groups (typically 10) based on their predicted probabilities. Within each group, it compares how many events you actually observed versus how many the model predicted. If these numbers diverge substantially across groups, the test flags poor model fit.

You should reach for this test when you’ve built a logistic regression model and want formal statistical evidence about its calibration. It’s particularly useful when you need to report model diagnostics in academic papers or regulatory submissions where reviewers expect to see established goodness-of-fit metrics.

Prerequisites and Setup

You’ll need the ResourceSelection package, which provides the most widely-used implementation of the Hosmer-Lemeshow test in R. The generalhoslem package offers additional variants if you need them.

# Install packages if needed
install.packages("ResourceSelection")
install.packages("generalhoslem")

# Load libraries
library(ResourceSelection)
library(generalhoslem)

For our examples, we’ll create a simulated dataset that represents a typical binary classification problem. This approach lets us control the data characteristics and demonstrate the test clearly.

# Set seed for reproducibility
set.seed(42)

# Simulate data for a medical screening scenario
n <- 500
age <- rnorm(n, mean = 55, sd = 12)
bmi <- rnorm(n, mean = 27, sd = 5)
smoking <- rbinom(n, 1, 0.3)

# Create linear predictor and binary outcome
linear_pred <- -6 + 0.05 * age + 0.08 * bmi + 0.7 * smoking
prob <- 1 / (1 + exp(-linear_pred))
disease <- rbinom(n, 1, prob)

# Combine into data frame
health_data <- data.frame(
  disease = disease,
  age = age,
  bmi = bmi,
  smoking = smoking
)

# Check the outcome distribution
table(health_data$disease)

This simulated dataset represents 500 patients with a binary disease outcome predicted by age, BMI, and smoking status. The outcome should be reasonably balanced, making it suitable for demonstrating logistic regression diagnostics.

Building a Logistic Regression Model

Before testing goodness-of-fit, you need a fitted logistic regression model. The glm() function with family = binomial handles this in R.

# Fit logistic regression model
model <- glm(disease ~ age + bmi + smoking, 
             data = health_data, 
             family = binomial(link = "logit"))

# View model summary
summary(model)

The summary output shows coefficient estimates, standard errors, and significance tests for each predictor. However, these tell you about individual variable effects—not overall model fit.

# Extract fitted probabilities
fitted_probs <- fitted(model)

# Examine distribution of predicted probabilities
summary(fitted_probs)
hist(fitted_probs, breaks = 20, 
     main = "Distribution of Predicted Probabilities",
     xlab = "Predicted Probability of Disease",
     col = "steelblue")

The fitted probabilities are what the Hosmer-Lemeshow test will evaluate. A well-calibrated model should produce probabilities that, when grouped, match the actual proportion of events observed in each group.

Performing the Hosmer-Lemeshow Test

The hoslem.test() function from ResourceSelection requires the observed outcomes and fitted probabilities. The g parameter specifies the number of groups—10 is the conventional default.

# Perform Hosmer-Lemeshow test
hl_test <- hoslem.test(health_data$disease, fitted_probs, g = 10)

# Display results
print(hl_test)

The output includes three key pieces of information: the chi-square statistic, degrees of freedom (g - 2), and the p-value.

# Access individual components
hl_test$statistic  # Chi-square value
hl_test$parameter  # Degrees of freedom
hl_test$p.value    # P-value

Choosing the number of groups matters more than many practitioners realize. While 10 is standard, you should have enough observations per group for stable estimates. A rough guideline is at least 10-20 observations per group. With 500 observations and 10 groups, we have approximately 50 per group—plenty for reliable results.

# Test sensitivity to group count
hl_g8 <- hoslem.test(health_data$disease, fitted_probs, g = 8)
hl_g10 <- hoslem.test(health_data$disease, fitted_probs, g = 10)
hl_g12 <- hoslem.test(health_data$disease, fitted_probs, g = 12)

# Compare p-values
cat("g=8:  p =", round(hl_g8$p.value, 4), "\n")
cat("g=10: p =", round(hl_g10$p.value, 4), "\n")
cat("g=12: p =", round(hl_g12$p.value, 4), "\n")

If your p-value changes dramatically with different group counts, treat your results with skepticism.

Interpreting the Results

The Hosmer-Lemeshow test uses a chi-square distribution to assess whether observed and expected frequencies differ significantly. Here’s how to interpret the output:

Non-significant p-value (p > 0.05): The model’s predicted probabilities are consistent with observed outcomes. This suggests adequate fit, but doesn’t prove the model is correct or optimal.

Significant p-value (p ≤ 0.05): The model’s predictions deviate significantly from observations. The model may be missing important predictors, have incorrect functional forms, or otherwise fail to capture the data’s structure.

The contingency table provides deeper insight into where misfit occurs:

# Access the observed vs expected table
hl_table <- hl_test$observed
print(hl_table)

# Create a more informative comparison
comparison <- data.frame(
  group = 1:10,
  observed_0 = hl_test$observed[, 1],
  expected_0 = hl_test$expected[, 1],
  observed_1 = hl_test$observed[, 2],
  expected_1 = hl_test$expected[, 2]
)

# Calculate differences
comparison$diff_0 <- comparison$observed_0 - comparison$expected_0
comparison$diff_1 <- comparison$observed_1 - comparison$expected_1

print(comparison)

Large differences in specific groups reveal where your model struggles. If the model consistently underpredicts events in high-risk groups, you might need additional predictors or interaction terms.

# Visualize observed vs expected
par(mfrow = c(1, 2))

barplot(rbind(comparison$observed_1, comparison$expected_1), 
        beside = TRUE, 
        names.arg = 1:10,
        col = c("steelblue", "coral"),
        main = "Events (Y=1) by Decile",
        xlab = "Risk Group", 
        ylab = "Count",
        legend.text = c("Observed", "Expected"))

barplot(rbind(comparison$observed_0, comparison$expected_0), 
        beside = TRUE, 
        names.arg = 1:10,
        col = c("steelblue", "coral"),
        main = "Non-Events (Y=0) by Decile",
        xlab = "Risk Group", 
        ylab = "Count")

par(mfrow = c(1, 1))

Limitations and Alternatives

The Hosmer-Lemeshow test has well-documented weaknesses that you should understand before relying on it exclusively.

Sample size sensitivity: With large samples, the test often rejects models that are practically adequate. With small samples, it lacks power to detect genuine misfit. There’s no universally “right” sample size where it works optimally.

Arbitrary grouping: The choice of group count affects results, and there’s no principled way to select it. Different groupings can yield contradictory conclusions from the same data.

Limited diagnostic value: A significant test tells you something is wrong but not what. You can’t determine whether you need different predictors, transformations, or model structures.

Calibration plots offer a superior alternative for most purposes:

# Create calibration plot
library(ggplot2)

# Bin predictions and calculate observed proportions
cal_data <- data.frame(
  predicted = fitted_probs,
  observed = health_data$disease
)

cal_data$bin <- cut(cal_data$predicted, 
                    breaks = seq(0, 1, by = 0.1), 
                    include.lowest = TRUE)

cal_summary <- aggregate(cbind(predicted, observed) ~ bin, 
                         data = cal_data, 
                         FUN = mean)

# Plot
ggplot(cal_summary, aes(x = predicted, y = observed)) +
  geom_point(size = 3, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
  xlim(0, 1) + ylim(0, 1) +
  labs(x = "Mean Predicted Probability",
       y = "Observed Proportion",
       title = "Calibration Plot") +
  theme_minimal()

A well-calibrated model produces points that fall along the diagonal line. Deviations show you exactly where and how the model miscalibrates—information the Hosmer-Lemeshow test can’t provide.

The le Cessie-van Houwelingen test offers another alternative that doesn’t require arbitrary grouping:

# Using generalhoslem package for alternative tests
library(generalhoslem)

# Perform alternative goodness-of-fit tests
logitgof(health_data$disease, fitted_probs, g = 10, ord = FALSE)

Conclusion

The Hosmer-Lemeshow test remains a standard tool for assessing logistic regression fit, despite its limitations. Here’s the practical workflow:

Fit your logistic regression model with glm()
Extract fitted probabilities with fitted()
Run hoslem.test() with g = 10 as your starting point
Examine both the p-value and the observed/expected table
Supplement with calibration plots for visual diagnostics

When reporting results, include the chi-square statistic, degrees of freedom, and p-value. State your group count explicitly. If the test is significant, investigate the contingency table to understand where misfit occurs.

Don’t treat a non-significant Hosmer-Lemeshow test as proof of a good model—it only indicates the absence of detected misfit. Always combine it with other diagnostics, discrimination metrics like AUC, and substantive evaluation of whether your model makes scientific sense.