How to Perform a Score Test in Python

Key Insights

Score tests evaluate hypotheses using only the null model, making them computationally efficient and reliable when the alternative model is difficult to fit or may not converge.
Unlike Wald tests that can behave poorly near parameter boundaries, score tests maintain correct Type I error rates even in challenging scenarios like small samples or parameters near zero.
Python’s statsmodels provides built-in score test functionality, but understanding the manual calculation reveals why these tests work and when to trust them.

Introduction to Score Tests

The score test, also known as the Lagrange multiplier test, is one of three classical approaches to hypothesis testing in maximum likelihood estimation. While the Wald test and likelihood ratio test get more attention, the score test has a crucial advantage: it only requires fitting the null model.

Consider testing whether a new predictor improves your logistic regression model. The Wald test requires fitting the full model and examining coefficient standard errors. The likelihood ratio test needs both null and alternative models fitted. The score test asks a different question: “At the null model’s parameter estimates, how much does the likelihood want to move toward the alternative?”

This makes score tests invaluable when the alternative model is computationally expensive, fails to converge, or sits near parameter boundaries where Wald tests become unreliable.

Mathematical Foundation

The score function is the gradient of the log-likelihood with respect to parameters:

$$U(\theta) = \frac{\partial \ell(\theta)}{\partial \theta}$$

At the maximum likelihood estimate, the score equals zero—that’s the definition of a maximum. But under the null hypothesis, if we’ve constrained some parameters, the score for those parameters typically won’t be zero. Large score values suggest the constraint is wrong.

The Fisher information matrix captures how much information the data contains about parameters:

$$I(\theta) = -E\left[\frac{\partial^2 \ell(\theta)}{\partial \theta \partial \theta^T}\right]$$

The score test statistic combines these:

$$S = U(\hat{\theta}_0)^T I(\hat{\theta}_0)^{-1} U(\hat{\theta}_0)$$

Under the null hypothesis, this follows a chi-squared distribution with degrees of freedom equal to the number of constrained parameters. The intuition: we’re measuring how far the score deviates from zero, scaled by its variance.

Score Test for Logistic Regression

Let’s implement a score test to determine whether adding a predictor improves a logistic regression model. We’ll use statsmodels, which provides built-in score test functionality.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats

# Generate sample data
np.random.seed(42)
n = 200

X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)
X3 = np.random.normal(0, 1, n)  # Variable we want to test

# True model: only X1 and X2 matter
linear_pred = -0.5 + 0.8 * X1 + 0.6 * X2
prob = 1 / (1 + np.exp(-linear_pred))
y = np.random.binomial(1, prob, n)

# Create design matrices
X_null = sm.add_constant(np.column_stack([X1, X2]))
X_full = sm.add_constant(np.column_stack([X1, X2, X3]))

# Fit the null model
null_model = sm.Logit(y, X_null).fit(disp=False)

# Perform score test for adding X3
# We need to specify which parameters to test
score_result = null_model.score_test(
    exog_extra=X3.reshape(-1, 1)
)

print("Score Test Results")
print(f"Test statistic: {score_result[0]:.4f}")
print(f"P-value: {score_result[1]:.4f}")
print(f"Degrees of freedom: {score_result[2]}")

The score_test method accepts exog_extra, the additional variables you’re testing. Since X3 has no true effect in our data-generating process, we expect a non-significant p-value.

For testing multiple coefficients simultaneously:

# Test whether X2 and X3 jointly improve the model with only X1
X_reduced = sm.add_constant(X1)
reduced_model = sm.Logit(y, X_reduced).fit(disp=False)

extra_vars = np.column_stack([X2, X3])
joint_score_test = reduced_model.score_test(exog_extra=extra_vars)

print(f"\nJoint Score Test for X2 and X3")
print(f"Test statistic: {joint_score_test[0]:.4f}")
print(f"P-value: {joint_score_test[1]:.4f}")

Manual Implementation from Scratch

Understanding the mechanics requires building the test yourself. Here’s a complete implementation for logistic regression:

def manual_score_test(y, X_null, X_test):
    """
    Perform score test for adding X_test variables to a logistic regression.
    
    Parameters:
    -----------
    y : array-like, binary outcome
    X_null : array-like, null model design matrix (include intercept)
    X_test : array-like, additional variables to test
    
    Returns:
    --------
    dict with test statistic, p-value, and degrees of freedom
    """
    # Fit null model
    null_model = sm.Logit(y, X_null).fit(disp=False)
    beta_null = null_model.params
    
    # Predicted probabilities under null
    linear_pred = X_null @ beta_null
    p_hat = 1 / (1 + np.exp(-linear_pred))
    
    # Residuals under null model
    residuals = y - p_hat
    
    # Score vector for test variables
    # U = X_test.T @ (y - p_hat)
    score_vector = X_test.T @ residuals
    
    # Weight matrix for Fisher information
    W = np.diag(p_hat * (1 - p_hat))
    
    # Full design matrix
    X_full = np.column_stack([X_null, X_test])
    
    # Fisher information matrix (full model, evaluated at null)
    I_full = X_full.T @ W @ X_full
    
    # Fisher information for null parameters only
    n_null = X_null.shape[1]
    I_null = I_full[:n_null, :n_null]
    
    # Cross information
    I_cross = I_full[:n_null, n_null:]
    
    # Information for test parameters
    I_test = I_full[n_null:, n_null:]
    
    # Efficient information matrix for test parameters
    # This accounts for estimation of null parameters
    I_efficient = I_test - I_cross.T @ np.linalg.solve(I_null, I_cross)
    
    # Score test statistic
    test_stat = score_vector @ np.linalg.solve(I_efficient, score_vector)
    
    # Degrees of freedom
    df = X_test.shape[1] if X_test.ndim > 1 else 1
    
    # P-value from chi-squared distribution
    p_value = 1 - stats.chi2.cdf(test_stat, df)
    
    return {
        'statistic': test_stat,
        'p_value': p_value,
        'df': df,
        'score_vector': score_vector
    }

# Test our manual implementation
result = manual_score_test(y, X_null, X3.reshape(-1, 1))
print("Manual Score Test Results")
print(f"Test statistic: {result['statistic']:.4f}")
print(f"P-value: {result['p_value']:.4f}")

The key insight is the efficient information matrix. We can’t simply use the information for the test parameters alone—we must account for the fact that null parameters are estimated, not known.

Score Test for Poisson Regression

Score tests extend naturally to other generalized linear models. For count data, testing for overdispersion or variable significance follows the same logic:

from statsmodels.discrete.discrete_model import Poisson

# Generate Poisson regression data
np.random.seed(123)
n = 300

X1 = np.random.uniform(0, 2, n)
X2 = np.random.normal(0, 1, n)
X3 = np.random.normal(0, 1, n)  # No true effect

# True model
log_mu = 0.5 + 0.4 * X1 + 0.3 * X2
mu = np.exp(log_mu)
y_count = np.random.poisson(mu)

# Fit null model
X_null_pois = sm.add_constant(np.column_stack([X1, X2]))
null_poisson = Poisson(y_count, X_null_pois).fit(disp=False)

# Score test for X3
score_pois = null_poisson.score_test(exog_extra=X3.reshape(-1, 1))

print("Poisson Regression Score Test")
print(f"Testing addition of X3")
print(f"Test statistic: {score_pois[0]:.4f}")
print(f"P-value: {score_pois[1]:.4f}")

# Test a variable with true effect
# Remove X2 from null model and test for it
X_reduced_pois = sm.add_constant(X1)
reduced_poisson = Poisson(y_count, X_reduced_pois).fit(disp=False)

score_x2 = reduced_poisson.score_test(exog_extra=X2.reshape(-1, 1))
print(f"\nTesting addition of X2 (true effect)")
print(f"Test statistic: {score_x2[0]:.4f}")
print(f"P-value: {score_x2[1]:.4f}")

Comparing Score, Wald, and Likelihood Ratio Tests

The three classical tests are asymptotically equivalent but can diverge substantially in finite samples. Here’s a practical comparison:

def compare_tests(y, X_null, X_full, test_indices):
    """Compare Score, Wald, and LR tests for specified coefficients."""
    
    # Fit both models
    null_model = sm.Logit(y, X_null).fit(disp=False)
    full_model = sm.Logit(y, X_full).fit(disp=False)
    
    n_test = len(test_indices)
    
    # Likelihood Ratio Test
    lr_stat = 2 * (full_model.llf - null_model.llf)
    lr_pval = 1 - stats.chi2.cdf(lr_stat, n_test)
    
    # Wald Test (from full model)
    R = np.zeros((n_test, len(full_model.params)))
    for i, idx in enumerate(test_indices):
        R[i, idx] = 1
    
    wald_result = full_model.wald_test(R)
    wald_stat = wald_result.statistic
    wald_pval = wald_result.pvalue
    
    # Score Test
    extra_cols = X_full[:, test_indices]
    score_result = null_model.score_test(exog_extra=extra_cols)
    score_stat = score_result[0]
    score_pval = score_result[1]
    
    return {
        'LR': {'stat': lr_stat, 'pval': lr_pval},
        'Wald': {'stat': float(wald_stat), 'pval': float(wald_pval)},
        'Score': {'stat': score_stat, 'pval': score_pval}
    }

# Small sample comparison where tests diverge
np.random.seed(999)
n_small = 50

X1_small = np.random.normal(0, 1, n_small)
X2_small = np.random.normal(0, 1, n_small)

# Weak effect that's hard to detect
linear_small = -0.3 + 0.3 * X1_small + 0.5 * X2_small
prob_small = 1 / (1 + np.exp(-linear_small))
y_small = np.random.binomial(1, prob_small, n_small)

X_null_small = sm.add_constant(X1_small)
X_full_small = sm.add_constant(np.column_stack([X1_small, X2_small]))

results = compare_tests(y_small, X_null_small, X_full_small, [2])

print("Test Comparison (n=50, testing X2)")
print("-" * 40)
for test_name, values in results.items():
    print(f"{test_name:6s}: stat = {values['stat']:6.3f}, p = {values['pval']:.4f}")

Run this multiple times with different seeds. You’ll observe that while the tests usually agree, small samples and weak effects produce noticeable divergence.

Best Practices and Common Pitfalls

When to prefer score tests:

The alternative model is computationally expensive or fails to converge
Parameters under test might be near boundaries (e.g., variance components near zero)
You’re performing many tests and want consistent computational cost
Model diagnostics suggest the full model fit is unreliable

Common mistakes to avoid:

Ignoring the efficient information. The naive approach of using only the test variables’ information ignores uncertainty in null parameter estimates. Always use the full information matrix partitioning.
Misinterpreting non-significant results. A non-significant score test means the likelihood isn’t “pulling” toward the alternative at the null. It doesn’t mean the alternative is wrong—just that you lack evidence against the null.
Applying to misspecified null models. Score tests assume the null model is correctly specified. If your null model has omitted variable bias or wrong functional form, the score test inherits these problems.
Forgetting about multiple testing. When testing many variables, adjust p-values appropriately. The score test offers no protection against multiple comparison problems.

Computational considerations: Score tests require computing second derivatives (or their expectations) of the log-likelihood. For complex models, this can be expensive. However, you only do it once at the null, whereas likelihood ratio tests require full optimization of the alternative model.

For routine hypothesis testing in well-behaved models with moderate samples, all three classical tests perform similarly. Reserve score tests for situations where their unique advantages—null-only computation and boundary robustness—matter most.