How to Perform a Likelihood Ratio Test in Python

Key Insights

The likelihood ratio test compares nested models by measuring how much the log-likelihood improves when adding parameters, with the test statistic following a chi-squared distribution under the null hypothesis.
Python’s statsmodels library provides log-likelihood values directly from fitted models, making LRT implementation straightforward with just a few lines of code.
Always verify your models are truly nested before applying LRT—the simpler model’s parameters must be a subset of the complex model’s parameters, or your results will be meaningless.

Introduction to Likelihood Ratio Tests

The likelihood ratio test (LRT) answers a fundamental question in statistical modeling: does adding complexity to my model provide a statistically significant improvement in fit? When you’re deciding whether to include additional predictors in a logistic regression, comparing Poisson models with different covariates, or evaluating any pair of nested models, LRT gives you a principled way to make that decision.

Unlike information criteria (AIC, BIC) that provide relative comparisons, LRT produces a p-value you can use for formal hypothesis testing. The null hypothesis states that the simpler model is sufficient; the alternative hypothesis claims the more complex model provides a significantly better fit.

Common applications include testing whether a group of variables improves a logistic regression model, comparing linear mixed models with different random effects structures, and validating feature selection decisions in predictive modeling pipelines.

The Mathematics Behind LRT

The test statistic has an elegant form. Given a null model with likelihood L₀ and an alternative model with likelihood L₁, the likelihood ratio test statistic is:

LRT = -2 × log(L₀/L₁) = -2 × (log(L₀) - log(L₁)) = 2 × (LL₁ - LL₀)

Here, LL represents the log-likelihood. Since the alternative model has more parameters and can fit the data at least as well, LL₁ ≥ LL₀, making the test statistic non-negative.

Under the null hypothesis (the simpler model is correct), this statistic follows a chi-squared distribution. The degrees of freedom equal the difference in the number of parameters between models. If your null model has 3 parameters and your alternative has 5, you use χ² with 2 degrees of freedom.

This chi-squared approximation relies on asymptotic theory, meaning it works well with large samples but can be unreliable with small datasets or when parameters are near boundary values.

Setting Up Your Python Environment

You need three libraries for LRT implementation:

pip install numpy scipy statsmodels

Here’s the foundation for our examples:

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import logit, ols, poisson

# Set random seed for reproducibility
np.random.seed(42)

# Create sample dataset
n = 500
data = pd.DataFrame({
    'x1': np.random.normal(0, 1, n),
    'x2': np.random.normal(0, 1, n),
    'x3': np.random.normal(0, 1, n),  # Noise variable
    'x4': np.random.normal(0, 1, n),  # Noise variable
})

# Generate binary outcome with true relationship to x1 and x2 only
linear_pred = -0.5 + 0.8 * data['x1'] + 1.2 * data['x2']
prob = 1 / (1 + np.exp(-linear_pred))
data['y'] = np.random.binomial(1, prob)

print(f"Dataset shape: {data.shape}")
print(f"Outcome distribution: {data['y'].value_counts().to_dict()}")

This synthetic dataset has a binary outcome influenced by x1 and x2, while x3 and x4 are noise. We’ll use LRT to detect which variables matter.

LRT for Logistic Regression Models

Let’s implement a complete likelihood ratio test comparing two logistic regression models. The null model includes only x1, while the alternative adds x2:

# Fit the null model (simpler)
model_null = logit('y ~ x1', data=data).fit(disp=0)

# Fit the alternative model (more complex)
model_alt = logit('y ~ x1 + x2', data=data).fit(disp=0)

# Extract log-likelihoods
ll_null = model_null.llf
ll_alt = model_alt.llf

# Calculate LRT statistic
lrt_statistic = 2 * (ll_alt - ll_null)

# Degrees of freedom = difference in number of parameters
df = model_alt.df_model - model_null.df_model

# Calculate p-value from chi-squared distribution
p_value = stats.chi2.sf(lrt_statistic, df)

print(f"Null model log-likelihood: {ll_null:.4f}")
print(f"Alternative model log-likelihood: {ll_alt:.4f}")
print(f"LRT statistic: {lrt_statistic:.4f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p_value:.6f}")

Running this produces output like:

Null model log-likelihood: -306.2847
Alternative model log-likelihood: -267.1923
LRT statistic: 78.1848
Degrees of freedom: 1
P-value: 0.000000

The tiny p-value correctly identifies that x2 significantly improves the model. Now let’s test whether adding the noise variables helps:

# Compare: model with x1, x2 vs model with x1, x2, x3, x4
model_base = logit('y ~ x1 + x2', data=data).fit(disp=0)
model_full = logit('y ~ x1 + x2 + x3 + x4', data=data).fit(disp=0)

lrt_stat = 2 * (model_full.llf - model_base.llf)
df = model_full.df_model - model_base.df_model
p_val = stats.chi2.sf(lrt_stat, df)

print(f"LRT statistic: {lrt_stat:.4f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p_val:.4f}")

This should yield a non-significant p-value (typically > 0.05), confirming that x3 and x4 don’t improve the model—exactly what we’d expect given how we generated the data.

LRT for Other Model Types

The same logic applies to any model that reports log-likelihood. Here’s a reusable function that works across model types:

def likelihood_ratio_test(model_null, model_alt, df=None):
    """
    Perform likelihood ratio test comparing two nested models.
    
    Parameters
    ----------
    model_null : fitted statsmodels model
        The simpler (nested) model under the null hypothesis
    model_alt : fitted statsmodels model  
        The more complex model under the alternative hypothesis
    df : int, optional
        Degrees of freedom. If None, calculated from models.
        
    Returns
    -------
    dict with test statistic, degrees of freedom, and p-value
    """
    ll_null = model_null.llf
    ll_alt = model_alt.llf
    
    # Sanity check: alternative should fit at least as well
    if ll_alt < ll_null:
        raise ValueError(
            "Alternative model has lower log-likelihood than null. "
            "Models may not be properly nested."
        )
    
    lrt_statistic = 2 * (ll_alt - ll_null)
    
    if df is None:
        # Works for GLMs; may need adjustment for other model types
        df = model_alt.df_model - model_null.df_model
    
    if df <= 0:
        raise ValueError(
            f"Degrees of freedom must be positive, got {df}. "
            "Check that models are correctly specified."
        )
    
    p_value = stats.chi2.sf(lrt_statistic, df)
    
    return {
        'statistic': lrt_statistic,
        'df': df,
        'p_value': p_value,
        'll_null': ll_null,
        'll_alt': ll_alt
    }

Here’s the function applied to Poisson regression:

# Generate count data
data['count'] = np.random.poisson(
    np.exp(0.5 + 0.3 * data['x1'] + 0.5 * data['x2']), 
    n
)

# Fit Poisson models
pois_null = poisson('count ~ x1', data=data).fit(disp=0)
pois_alt = poisson('count ~ x1 + x2', data=data).fit(disp=0)

result = likelihood_ratio_test(pois_null, pois_alt)
print(f"Poisson LRT: statistic={result['statistic']:.4f}, "
      f"df={result['df']}, p={result['p_value']:.6f}")

For linear regression with OLS:

# Generate continuous outcome
data['y_cont'] = 2 + 1.5 * data['x1'] + 0.8 * data['x2'] + np.random.normal(0, 1, n)

# Fit OLS models
ols_null = ols('y_cont ~ x1', data=data).fit()
ols_alt = ols('y_cont ~ x1 + x2', data=data).fit()

result = likelihood_ratio_test(ols_null, ols_alt)
print(f"OLS LRT: statistic={result['statistic']:.4f}, "
      f"df={result['df']}, p={result['p_value']:.6f}")

Interpreting Results and Common Pitfalls

A low p-value (typically < 0.05) means you reject the null hypothesis—the additional parameters in the complex model provide a statistically significant improvement. A high p-value suggests the simpler model is adequate.

Here’s a helper function for clean interpretation:

def format_lrt_results(result, alpha=0.05, model_names=None):
    """
    Format LRT results with interpretation.
    
    Parameters
    ----------
    result : dict
        Output from likelihood_ratio_test()
    alpha : float
        Significance level for interpretation
    model_names : tuple of str, optional
        Names for (null_model, alt_model)
    """
    if model_names is None:
        model_names = ('Null model', 'Alternative model')
    
    print("=" * 50)
    print("LIKELIHOOD RATIO TEST RESULTS")
    print("=" * 50)
    print(f"\n{model_names[0]} log-likelihood: {result['ll_null']:.4f}")
    print(f"{model_names[1]} log-likelihood: {result['ll_alt']:.4f}")
    print(f"\nTest statistic (χ²): {result['statistic']:.4f}")
    print(f"Degrees of freedom: {result['df']}")
    print(f"P-value: {result['p_value']:.6f}")
    print(f"\nSignificance level: α = {alpha}")
    
    if result['p_value'] < alpha:
        print(f"\nConclusion: REJECT null hypothesis (p < {alpha})")
        print(f"The {model_names[1].lower()} provides a significantly")
        print("better fit than the null model.")
    else:
        print(f"\nConclusion: FAIL TO REJECT null hypothesis (p ≥ {alpha})")
        print(f"No significant improvement from the {model_names[1].lower()}.")
        print("The simpler model is preferred.")
    print("=" * 50)

# Usage
result = likelihood_ratio_test(model_null, model_alt)
format_lrt_results(result, model_names=('Reduced model', 'Full model'))

Critical pitfalls to avoid:

Non-nested models: LRT requires the null model to be a special case of the alternative. You cannot compare a model with x1 + x2 to one with x1 + x3—neither is nested within the other.
Small samples: The chi-squared approximation breaks down with small samples. Consider bootstrap methods or exact tests when n < 100.
Boundary parameters: When testing whether a variance component equals zero (common in mixed models), the standard chi-squared distribution is incorrect. Use mixture distributions or specialized tests.
Different data subsets: Both models must be fit on identical data. Missing values handled differently between models will invalidate the test.

Conclusion

The likelihood ratio test is a workhorse for model comparison in statistical practice. The implementation is straightforward: extract log-likelihoods from your fitted models, compute the test statistic as twice the difference, and compare against a chi-squared distribution with degrees of freedom equal to the parameter difference.

When should you prefer LRT over alternatives? Use LRT when you have reasonably large samples and want a formal hypothesis test. The Wald test (reported by default in most regression output) is computationally simpler but can be unreliable when parameters are far from zero or sample sizes are modest. The Score test doesn’t require fitting the alternative model, which matters for computationally expensive models, but LRT typically has better power.

For model selection without formal hypothesis testing, information criteria like AIC and BIC remain valuable—they don’t require nested models and naturally penalize complexity. But when you need a p-value to justify including specific variables, the likelihood ratio test delivers.