How to Perform a Likelihood Ratio Test in R

The likelihood ratio test (LRT) answers a fundamental question in statistical modeling: does adding complexity to your model provide a meaningful improvement in fit? When you're deciding whether to...

Key Insights

  • The likelihood ratio test compares nested models by examining whether additional parameters significantly improve model fit, using a chi-square distributed test statistic with degrees of freedom equal to the difference in parameters between models.
  • Base R provides all the tools you need with logLik() and pchisq(), but the lmtest package’s lrtest() function streamlines the process for most regression models.
  • For mixed-effects models, you must fit with maximum likelihood (ML) rather than REML when comparing models with different fixed effects, and the standard chi-square approximation can be conservative for random effects.

Introduction to Likelihood Ratio Tests

The likelihood ratio test (LRT) answers a fundamental question in statistical modeling: does adding complexity to your model provide a meaningful improvement in fit? When you’re deciding whether to include additional predictors, interaction terms, or random effects, the LRT gives you a principled way to make that decision.

Unlike Wald tests, which rely on asymptotic approximations of standard errors, the LRT directly compares how well each model explains the observed data. This makes it more reliable for small samples and when parameters are near boundary values. The LRT also tends to have better finite-sample properties than score tests, making it the preferred choice for many applied statisticians.

The key requirement is that your models must be nested—the simpler model must be a special case of the more complex one. You can’t use an LRT to compare a model with predictors A and B against one with predictors C and D. But you can compare a model with just A against one with A and B.

The Mathematical Foundation

The LRT statistic has an elegant form. Given two nested models, the test statistic is:

$$\Lambda = -2 \log\left(\frac{L_0}{L_1}\right) = -2(\ell_0 - \ell_1)$$

Where $L_0$ is the likelihood of the reduced model, $L_1$ is the likelihood of the full model, and $\ell$ represents log-likelihoods. Since the full model always fits at least as well as the reduced model, this statistic is always non-negative.

Under the null hypothesis that the reduced model is correct, this statistic follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models. If your full model has 5 parameters and your reduced model has 3, you’re working with 2 degrees of freedom.

The intuition is straightforward: if the additional parameters in the full model don’t improve fit beyond what you’d expect by chance, the likelihood ratio will be close to 1, and the test statistic will be close to 0. Large values indicate the full model provides a meaningfully better fit.

Performing LRT with Base R

You don’t need any packages to perform a likelihood ratio test. Base R’s logLik() function extracts log-likelihoods from fitted model objects, and pchisq() computes p-values from the chi-square distribution.

# Generate example data
set.seed(42)
n <- 200
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)  # This predictor has no true effect
y <- 2 + 0.5 * x1 - 0.8 * x2 + rnorm(n, sd = 1.5)

# Fit nested models
model_reduced <- glm(y ~ x1 + x2, family = gaussian)
model_full <- glm(y ~ x1 + x2 + x3, family = gaussian)

# Extract log-likelihoods
ll_reduced <- logLik(model_reduced)
ll_full <- logLik(model_full)

# Calculate test statistic
test_stat <- -2 * (as.numeric(ll_reduced) - as.numeric(ll_full))

# Degrees of freedom: difference in number of parameters
df_diff <- attr(ll_full, "df") - attr(ll_reduced, "df")

# Calculate p-value
p_value <- pchisq(test_stat, df = df_diff, lower.tail = FALSE)

cat("Test statistic:", round(test_stat, 4), "\n")
cat("Degrees of freedom:", df_diff, "\n")
cat("P-value:", round(p_value, 4), "\n")

This outputs the test statistic, degrees of freedom, and p-value. Since x3 has no true effect in our data-generating process, you’ll typically see a non-significant p-value, correctly indicating that the additional parameter doesn’t improve model fit.

The manual approach gives you complete control and transparency. You see exactly what’s being computed, which matters when you’re debugging or teaching.

Using the lmtest Package

For everyday use, the lmtest package provides lrtest(), which handles the boilerplate for you. It works with most model objects that have a logLik() method.

library(lmtest)

# Same models as before
model_reduced <- lm(y ~ x1 + x2)
model_full <- lm(y ~ x1 + x2 + x3)

# One-line LRT
lrtest(model_reduced, model_full)

The output includes both models’ log-likelihoods, the test statistic, degrees of freedom, and p-value in a clean table format.

For logistic regression, the workflow is identical:

# Binary outcome data
set.seed(123)
n <- 300
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)  # No true effect
prob <- plogis(-0.5 + 0.8 * x1 - 0.6 * x2)
y_binary <- rbinom(n, 1, prob)

# Fit logistic regression models
logit_reduced <- glm(y_binary ~ x1 + x2, family = binomial)
logit_full <- glm(y_binary ~ x1 + x2 + x3, family = binomial)

# Perform LRT
lrtest(logit_reduced, logit_full)

# You can also compare multiple nested models at once
logit_null <- glm(y_binary ~ 1, family = binomial)
logit_x1 <- glm(y_binary ~ x1, family = binomial)

lrtest(logit_null, logit_x1, logit_reduced, logit_full)

The multi-model comparison is particularly useful for building up complexity incrementally and seeing where the significant improvements occur.

LRT for Mixed-Effects Models

Mixed-effects models require special attention. The lme4 package fits these models, and the anova() function performs likelihood ratio tests when given multiple model objects.

Critical point: when comparing models with different fixed effects, you must use maximum likelihood (ML) estimation, not restricted maximum likelihood (REML). REML likelihoods aren’t comparable across models with different fixed effects.

library(lme4)

# Simulated longitudinal data
set.seed(456)
n_subjects <- 50
n_obs <- 6
subject <- rep(1:n_subjects, each = n_obs)
time <- rep(0:(n_obs - 1), n_subjects)
treatment <- rep(c(0, 1), each = n_subjects * n_obs / 2)

# Random intercepts for subjects
random_intercepts <- rnorm(n_subjects, sd = 2)
y_long <- 10 + 0.5 * time + 1.2 * treatment + 
          random_intercepts[subject] + rnorm(n_subjects * n_obs)

data_long <- data.frame(y = y_long, time, treatment, subject = factor(subject))

# Fit models with ML (not REML) for fixed effects comparison
model_time <- lmer(y ~ time + (1 | subject), data = data_long, REML = FALSE)
model_full <- lmer(y ~ time + treatment + (1 | subject), data = data_long, REML = FALSE)

# LRT via anova
anova(model_time, model_full)

For comparing random effects structures, the situation is more nuanced. The null hypothesis places the variance parameter on the boundary of the parameter space (variance can’t be negative), so the chi-square approximation is conservative. Some statisticians recommend halving the p-value for single variance component tests:

# Comparing random effects structures
model_ri <- lmer(y ~ time + treatment + (1 | subject), data = data_long, REML = TRUE)
model_rs <- lmer(y ~ time + treatment + (time | subject), data = data_long, REML = TRUE)

# Standard anova (conservative for random effects)
anova(model_ri, model_rs, refit = FALSE)

The refit = FALSE argument keeps the REML fit when comparing random effects structures, which is appropriate here since the fixed effects are identical.

Interpreting Results and Assumptions

When reading LRT output, focus on three things: the test statistic magnitude, degrees of freedom, and p-value. A significant p-value (typically < 0.05) suggests the full model provides a meaningfully better fit than the reduced model.

But don’t interpret this mechanically. Consider:

Sample size matters. With large samples, even trivially small effects become statistically significant. A significant LRT doesn’t mean the additional parameters are practically important—just that they’re detectably different from zero.

Models must be truly nested. The reduced model must be obtainable by setting specific parameters in the full model to fixed values (usually zero). If you’re unsure, write out both model equations and verify.

Same data requirement. Both models must be fit to identical observations. If one model drops cases due to missing values in a variable not included in the other model, your comparison is invalid. Always check that nobs() returns the same value for both models.

Distributional assumptions. The chi-square approximation assumes correct model specification and sufficient sample size. For small samples or complex models, consider bootstrap-based alternatives.

Practical Example: Model Selection Workflow

Let’s walk through a realistic model selection scenario using the mtcars dataset. We want to model fuel efficiency and determine which predictors meaningfully contribute.

library(lmtest)

data(mtcars)

# Start with a baseline model
m0 <- lm(mpg ~ 1, data = mtcars)  # Intercept only

# Add predictors incrementally based on theory
m1 <- lm(mpg ~ wt, data = mtcars)  # Weight
m2 <- lm(mpg ~ wt + hp, data = mtcars)  # Add horsepower
m3 <- lm(mpg ~ wt + hp + am, data = mtcars)  # Add transmission type
m4 <- lm(mpg ~ wt + hp + am + wt:am, data = mtcars)  # Add interaction

# Sequential likelihood ratio tests
cat("=== Model Comparison Results ===\n\n")

cat("M0 vs M1 (adding weight):\n")
print(lrtest(m0, m1))

cat("\nM1 vs M2 (adding horsepower):\n")
print(lrtest(m1, m2))

cat("\nM2 vs M3 (adding transmission):\n")
print(lrtest(m2, m3))

cat("\nM3 vs M4 (adding weight × transmission interaction):\n")
print(lrtest(m3, m4))

# Summary interpretation
cat("\n=== Interpretation ===\n")
cat("Based on LRT results at alpha = 0.05:\n")
cat("- Weight significantly improves fit over intercept-only\n")
cat("- Horsepower provides additional explanatory power\n")
cat("- Check p-values for transmission and interaction terms\n")

# Final model summary
cat("\n=== Selected Model ===\n")
summary(m3)  # Adjust based on your results

This workflow demonstrates how to build models incrementally, testing each addition. In practice, you’d combine LRT results with domain knowledge, effect size considerations, and diagnostic checks before settling on a final model.

The likelihood ratio test is one tool among many for model comparison. Use it when you have nested models and want a formal hypothesis test. For non-nested comparisons, turn to information criteria like AIC or BIC. For prediction-focused applications, cross-validation often provides more relevant guidance than any hypothesis test.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.