How to Perform a Score Test in R

Key Insights

Score tests evaluate hypotheses using only the null model, making them computationally efficient when the alternative model is expensive to fit or may not converge.
R provides built-in score test support through anova(..., test = "Rao") for GLMs, dispersiontest() for overdispersion, and survdiff() for survival analysis.
Score tests are asymptotically equivalent to Wald and likelihood ratio tests but can produce different results in finite samples—use all three when results are borderline significant.

Introduction to Score Tests

Score tests, also called Lagrange multiplier tests, represent one of the three classical approaches to hypothesis testing in maximum likelihood estimation. While Wald tests and likelihood ratio tests get more attention in introductory statistics courses, score tests offer a distinct computational advantage: they only require fitting the model under the null hypothesis.

Consider testing whether a coefficient equals zero in a complex model. The Wald test requires fitting the full model and examining the coefficient’s standard error. The likelihood ratio test requires fitting both the null and alternative models, then comparing their likelihoods. The score test asks a different question: if we’re at the null hypothesis, how much does the likelihood function want to move toward the alternative?

This makes score tests particularly valuable when:

The alternative model is computationally expensive or unstable
You’re screening many potential predictors
The full model may not converge
You want to test constraints without re-estimating parameters

The trade-off is that score tests can be less powerful than likelihood ratio tests in small samples, though all three tests are asymptotically equivalent under standard regularity conditions.

Mathematical Foundation

The score function is the gradient of the log-likelihood with respect to the parameters:

$$U(\theta) = \frac{\partial \ell(\theta)}{\partial \theta}$$

At the maximum likelihood estimate, the score equals zero—you’re at the peak of the likelihood surface. The score test exploits this property by evaluating the score at the null hypothesis value. If the null is true, the score should be close to zero. Large score values suggest the null is wrong.

The test statistic follows a chi-square distribution:

$$S = U(\theta_0)^T I(\theta_0)^{-1} U(\theta_0) \sim \chi^2_k$$

Here, $I(\theta_0)$ is the Fisher information matrix evaluated at the null, and $k$ is the number of restrictions being tested. The information matrix acts as a scaling factor, accounting for the curvature of the likelihood surface.

Intuitively, the score test measures how steeply the likelihood is climbing away from the null hypothesis, normalized by how variable that slope estimate is. A steep slope relative to its uncertainty suggests the null is wrong.

Score Test for GLM Parameters

R’s anova() function supports score tests for generalized linear models through the test = "Rao" argument. This tests whether adding predictors significantly improves the model.

# Simulate data for logistic regression
set.seed(42)
n <- 500
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)  # Noise variable
prob <- plogis(-0.5 + 0.8 * x1 + 1.2 * x2)
y <- rbinom(n, 1, prob)

data <- data.frame(y, x1, x2, x3)

# Fit null and full models
null_model <- glm(y ~ 1, family = binomial, data = data)
full_model <- glm(y ~ x1 + x2 + x3, family = binomial, data = data)

# Score test (Rao's test) for adding predictors
anova(null_model, full_model, test = "Rao")

The output shows the Rao statistic and p-value for each term added sequentially. You can also compare with other tests:

# Compare all three classical tests
anova(null_model, full_model, test = "Rao")   # Score test
anova(null_model, full_model, test = "LRT")   # Likelihood ratio test
anova(null_model, full_model, test = "Chisq") # Also LRT for binomial

# For individual coefficients, compare Wald (from summary) with score
summary(full_model)  # Wald z-tests

# Test single predictor with score test
model_no_x3 <- glm(y ~ x1 + x2, family = binomial, data = data)
anova(model_no_x3, full_model, test = "Rao")

When testing whether x3 contributes to the model, the score test evaluates the derivative of the likelihood with respect to the x3 coefficient, computed at the model where that coefficient is constrained to zero.

Score Test for Overdispersion

Poisson regression assumes the variance equals the mean. Real count data often violates this assumption, showing overdispersion where variance exceeds the mean. The AER package provides a score test specifically for this situation.

library(AER)

# Simulate overdispersed count data
set.seed(123)
n <- 300
x <- runif(n, 0, 5)

# Generate negative binomial data (overdispersed relative to Poisson)
mu <- exp(0.5 + 0.4 * x)
y_overdispersed <- rnbinom(n, size = 2, mu = mu)

# Fit Poisson model
pois_model <- glm(y_overdispersed ~ x, family = poisson)

# Score test for overdispersion
dispersiontest(pois_model)

# You can specify the alternative form of variance function
# H1: Var(y) = mu + alpha * mu (default)
dispersiontest(pois_model, trafo = 1)

# H1: Var(y) = mu + alpha * mu^2 (negative binomial)
dispersiontest(pois_model, trafo = 2)

The test returns a z-statistic and p-value. A significant result indicates the Poisson assumption is violated, and you should consider negative binomial regression or quasi-Poisson models.

# Compare with actual dispersion
summary(pois_model)$dispersion  # Assumed 1 for Poisson

# Estimate dispersion manually
pearson_resid <- residuals(pois_model, type = "pearson")
estimated_dispersion <- sum(pearson_resid^2) / pois_model$df.residual
estimated_dispersion  # Values >> 1 indicate overdispersion

Score Test in Survival Analysis

The survdiff() function in the survival package performs score tests (specifically, the log-rank test and its variants) for comparing survival curves between groups.

library(survival)

# Use built-in lung cancer dataset
data(lung)
lung$status <- lung$status - 1  # Convert to 0/1

# Score test comparing survival by sex
survdiff(Surv(time, status) ~ sex, data = lung)

The output provides the chi-square statistic and p-value for testing whether survival differs between groups. This is a score test derived from the Cox partial likelihood.

# Score test with multiple groups
# First, create age groups
lung$age_group <- cut(lung$age, breaks = c(0, 60, 70, 100),
                      labels = c("Under 60", "60-70", "Over 70"))

survdiff(Surv(time, status) ~ age_group, data = lung)

# Stratified test
survdiff(Surv(time, status) ~ sex + strata(age_group), data = lung)

# The rho parameter controls test weighting
# rho = 0: log-rank test (default)
# rho = 1: Peto-Peto test (weights early differences more)
survdiff(Surv(time, status) ~ sex, data = lung, rho = 1)

For Cox models, you can also extract score test results:

cox_model <- coxph(Surv(time, status) ~ sex + age + ph.ecog, data = lung)
summary(cox_model)  # Includes Score (logrank) test at bottom

Manual Implementation

Understanding the mechanics helps you apply score tests in non-standard situations. Here’s a complete implementation for testing a binomial proportion:

score_test_proportion <- function(x, n, p0) {


  # x: number of successes
  # n: number of trials
  # p0: null hypothesis proportion
  
  # Score function for binomial: U = x/p - (n-x)/(1-p)
  # At p0:
  score <- x / p0 - (n - x) / (1 - p0)
  

  # Fisher information for binomial: I = n / (p * (1-p))
  info <- n / (p0 * (1 - p0))
  
  # Score test statistic
  test_stat <- score^2 / info
  
  # P-value from chi-square distribution
  p_value <- 1 - pchisq(test_stat, df = 1)
  
  list(
    score = score,
    information = info,
    statistic = test_stat,
    p_value = p_value
  )
}

# Example: Test if proportion equals 0.5
# Observed: 62 successes out of 100 trials
result <- score_test_proportion(x = 62, n = 100, p0 = 0.5)
print(result)

# Compare with prop.test (uses continuity correction by default)
prop.test(62, 100, p = 0.5, correct = FALSE)

For a more complex example, here’s a score test for logistic regression coefficients:

score_test_logistic <- function(X, y, beta0) {
  # X: design matrix (including intercept)
  # y: binary response
  # beta0: null hypothesis parameter values
  
  # Linear predictor and probabilities under null
  eta <- X %*% beta0
  p <- plogis(eta)
  

  # Score vector: X'(y - p)
  score <- t(X) %*% (y - p)
  
  # Information matrix: X' diag(p(1-p)) X
  W <- diag(as.vector(p * (1 - p)))
  info <- t(X) %*% W %*% X
  
  # Test statistic
  test_stat <- t(score) %*% solve(info) %*% score
  
  # P-value
  df <- length(beta0)
  p_value <- 1 - pchisq(test_stat, df = df)
  
  list(
    score = as.vector(score),
    statistic = as.numeric(test_stat),
    df = df,
    p_value = as.numeric(p_value)
  )
}

# Test against null of no effect (intercept only)
X <- cbind(1, x1, x2)  # Using earlier simulated data
beta_null <- c(coef(null_model), 0, 0)  # Null: slopes are zero
score_test_logistic(X, y, beta_null)

Interpreting Results and Best Practices

When reading score test output, focus on the test statistic and p-value. The statistic follows a chi-square distribution under the null, with degrees of freedom equal to the number of restrictions tested.

Comparing the three classical tests:

# Fit models
null_mod <- glm(y ~ x1, family = binomial, data = data)
alt_mod <- glm(y ~ x1 + x2, family = binomial, data = data)

# Likelihood ratio test
lr_stat <- -2 * (logLik(null_mod) - logLik(alt_mod))

# Wald test (from full model)
wald_stat <- (coef(alt_mod)["x2"] / sqrt(vcov(alt_mod)["x2", "x2"]))^2

# Score test
score_result <- anova(null_mod, alt_mod, test = "Rao")
score_stat <- score_result$Rao[2]

cat("LR statistic:", lr_stat, "\n")
cat("Wald statistic:", wald_stat, "\n")
cat("Score statistic:", score_stat, "\n")

The three statistics typically agree closely with large samples. Disagreement signals either small sample issues or model misspecification—investigate further before drawing conclusions.

Reporting guidelines:

State which test you used and why
Report the test statistic, degrees of freedom, and p-value
For borderline results, consider reporting all three tests
Acknowledge that score tests may have lower power in small samples

Common pitfalls:

Ignoring model assumptions: Score tests inherit the assumptions of your likelihood specification. A significant overdispersion test doesn’t help if your link function is wrong.
Multiple testing: When screening many predictors, adjust for multiple comparisons or use the score tests as exploratory tools only.
Small samples: Asymptotic chi-square approximations can be poor with limited data. Consider exact tests or bootstrap procedures.

Score tests deserve a place in your statistical toolkit. They’re not always the best choice, but when you need to test hypotheses efficiently without fitting complex alternative models, they’re invaluable.