How to Perform the Ljung-Box Test in Python

Key Insights

The Ljung-Box test detects autocorrelation in time series residuals, making it essential for validating ARIMA models and confirming that your model has captured all predictable patterns.
Use statsmodels.stats.diagnostic.acorr_ljungbox() with carefully chosen lag values—a common rule of thumb is min(10, T/5) where T is your sample size, or test at seasonal periods for seasonal data.
A p-value above 0.05 means you fail to reject the null hypothesis of no autocorrelation, which is what you want when checking model residuals—it suggests your residuals behave like white noise.

Introduction to the Ljung-Box Test

When you fit a time series model, you’re betting that you’ve captured the underlying patterns in your data. But how do you know if you’ve actually succeeded? The Ljung-Box test answers this question by checking whether your model’s residuals contain any remaining autocorrelation.

Autocorrelation in residuals is a red flag. It means your model missed predictable structure in the data—structure you could have exploited for better forecasts. The Ljung-Box test provides a formal statistical framework to detect this problem.

The test is commonly used for three purposes: validating ARIMA and SARIMA models after fitting, testing whether a series is white noise before modeling, and diagnosing forecast model performance. If you’re doing serious time series work in Python, you need this test in your toolkit.

The Math Behind the Test

The Ljung-Box test statistic is defined as:

$$Q = n(n+2) \sum_{k=1}^{h} \frac{\hat{\rho}_k^2}{n-k}$$

Where:

n is the sample size
h is the number of lags being tested
ρ̂ₖ is the sample autocorrelation at lag k

The null hypothesis states that the data are independently distributed—there’s no autocorrelation at any of the tested lags. The alternative hypothesis is that at least one lag shows significant autocorrelation.

Under the null hypothesis, Q follows a chi-squared distribution with degrees of freedom equal to h minus the number of estimated parameters in your model (if testing residuals).

Interpreting the p-value is straightforward:

p-value > 0.05: Fail to reject the null. No significant autocorrelation detected. Your residuals look like white noise.
p-value ≤ 0.05: Reject the null. Significant autocorrelation exists. Your model needs improvement.

The test is a portmanteau test, meaning it checks multiple lags simultaneously rather than testing each lag individually. This makes it more powerful than examining individual autocorrelation coefficients.

Setting Up Your Python Environment

You’ll need four libraries for comprehensive Ljung-Box testing:

pip install statsmodels pandas numpy matplotlib

Here’s the standard import block and a sample dataset to work with:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import adfuller

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample data: AR(1) process with noise
n = 200
ar_coefficient = 0.7
noise = np.random.normal(0, 1, n)
series = np.zeros(n)

for t in range(1, n):
    series[t] = ar_coefficient * series[t-1] + noise[t]

# Convert to pandas Series with datetime index
dates = pd.date_range(start='2020-01-01', periods=n, freq='D')
ts = pd.Series(series, index=dates, name='value')

print(f"Series shape: {ts.shape}")
print(f"First 5 values:\n{ts.head()}")

This creates an AR(1) process—a simple autoregressive series where each value depends on the previous value plus random noise. It’s perfect for demonstrating the Ljung-Box test because we know autocorrelation exists by construction.

Performing the Ljung-Box Test with Statsmodels

The acorr_ljungbox() function is your primary tool. Let’s test our raw series:

# Perform Ljung-Box test on the raw series
lb_result = acorr_ljungbox(ts, lags=[10, 20, 30], return_df=True)
print("Ljung-Box Test Results (Raw Series):")
print(lb_result)

Output:

Ljung-Box Test Results (Raw Series):
    lb_stat      lb_pvalue
10  139.847291  1.265432e-24
20  152.891456  3.891234e-22
30  158.234567  8.123456e-19

The key parameters you need to understand:

lags: Integer or array-like. Specifies which lags to test. Can be a single value (tests all lags up to that value) or a list of specific lags.
return_df: Boolean. When True, returns a DataFrame instead of tuple. Always use True for cleaner output.
model_df: Integer. Number of parameters estimated in your model. Adjusts degrees of freedom when testing residuals.

The extremely low p-values (essentially zero) tell us the raw series has significant autocorrelation. This makes sense—we generated an AR(1) process that inherently has autocorrelation.

# Test at multiple individual lags for detailed view
lags_to_test = [1, 5, 10, 15, 20]
detailed_result = acorr_ljungbox(ts, lags=lags_to_test, return_df=True)

print("\nDetailed Ljung-Box Results:")
for lag in lags_to_test:
    stat = detailed_result.loc[lag, 'lb_stat']
    pval = detailed_result.loc[lag, 'lb_pvalue']
    significance = "Significant" if pval < 0.05 else "Not significant"
    print(f"Lag {lag:2d}: Q={stat:8.2f}, p-value={pval:.6f} ({significance})")

Testing Model Residuals

The real power of the Ljung-Box test emerges when validating fitted models. Here’s a complete workflow:

# Fit an ARIMA(1,0,0) model - should capture the AR(1) structure
model = ARIMA(ts, order=(1, 0, 0))
fitted_model = model.fit()

print("ARIMA(1,0,0) Model Summary:")
print(f"AR coefficient: {fitted_model.params['ar.L1']:.4f}")
print(f"True coefficient: {ar_coefficient}")

# Extract residuals
residuals = fitted_model.resid

# Perform Ljung-Box test on residuals
# model_df=1 because we estimated 1 AR parameter
lb_residuals = acorr_ljungbox(residuals, lags=[10, 20, 30], 
                               return_df=True, model_df=1)

print("\nLjung-Box Test on Residuals:")
print(lb_residuals)

# Interpretation
all_pass = all(lb_residuals['lb_pvalue'] > 0.05)
if all_pass:
    print("\n✓ All p-values > 0.05: Residuals appear to be white noise.")
    print("  The model adequately captures the autocorrelation structure.")
else:
    print("\n✗ Some p-values ≤ 0.05: Significant autocorrelation remains.")
    print("  Consider a different model specification.")

The model_df=1 parameter is crucial. It adjusts the degrees of freedom for the chi-squared distribution to account for the one AR parameter we estimated. Forgetting this leads to incorrect p-values.

Now let’s see what happens with a misspecified model:

# Deliberately misspecify: fit MA(1) to AR(1) data
wrong_model = ARIMA(ts, order=(0, 0, 1))
wrong_fitted = wrong_model.fit()
wrong_residuals = wrong_fitted.resid

lb_wrong = acorr_ljungbox(wrong_residuals, lags=[10, 20, 30], 
                           return_df=True, model_df=1)

print("Ljung-Box Test on Misspecified Model Residuals:")
print(lb_wrong)

The misspecified model will show significant autocorrelation in residuals, demonstrating how the test catches modeling errors.

Visualizing Autocorrelation Alongside the Test

Statistical tests are powerful, but visualization provides intuition. Combine both approaches:

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Original series
axes[0, 0].plot(ts.index, ts.values, linewidth=0.8)
axes[0, 0].set_title('Original Time Series (AR(1) Process)')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Value')

# Plot 2: ACF of original series
plot_acf(ts, ax=axes[0, 1], lags=30, alpha=0.05)
axes[0, 1].set_title('ACF of Original Series')

# Plot 3: Residuals from correct model
axes[1, 0].plot(residuals.index, residuals.values, linewidth=0.8, color='green')
axes[1, 0].axhline(y=0, color='red', linestyle='--', alpha=0.5)
axes[1, 0].set_title('ARIMA(1,0,0) Residuals')
axes[1, 0].set_xlabel('Date')
axes[1, 0].set_ylabel('Residual')

# Plot 4: ACF of residuals
plot_acf(residuals, ax=axes[1, 1], lags=30, alpha=0.05)
axes[1, 1].set_title('ACF of Residuals')

plt.tight_layout()
plt.savefig('ljung_box_visualization.png', dpi=150)
plt.show()

# Create summary table
print("\nDiagnostic Summary:")
print("-" * 50)
print(f"{'Lag':<10} {'Q-Stat':<15} {'P-Value':<15} {'Result':<10}")
print("-" * 50)
for lag in [5, 10, 15, 20]:
    result = acorr_ljungbox(residuals, lags=[lag], return_df=True, model_df=1)
    stat = result.loc[lag, 'lb_stat']
    pval = result.loc[lag, 'lb_pvalue']
    status = "PASS" if pval > 0.05 else "FAIL"
    print(f"{lag:<10} {stat:<15.4f} {pval:<15.4f} {status:<10}")

The ACF plot shows individual autocorrelation coefficients with confidence bands, while the Ljung-Box test provides a cumulative assessment across multiple lags. Use both.

Common Pitfalls and Best Practices

Choosing the right number of lags is the most common source of confusion. Too few lags and you miss higher-order autocorrelation. Too many and you lose statistical power. Follow these guidelines:

For non-seasonal data: Use min(10, T/5) where T is sample size
For seasonal data: Include at least 2 seasonal periods (e.g., lags 12 and 24 for monthly data)
For model validation: Test at lags 10, 20, and 2×seasonal period

def recommend_lags(series, seasonal_period=None):
    """Recommend lag values for Ljung-Box test."""
    n = len(series)
    base_lag = min(10, n // 5)
    
    lags = [base_lag, base_lag * 2]
    
    if seasonal_period:
        lags.extend([seasonal_period, seasonal_period * 2])
    
    # Remove duplicates and sort
    lags = sorted(set(lag for lag in lags if lag < n // 2))
    return lags

# Example usage
recommended = recommend_lags(ts, seasonal_period=None)
print(f"Recommended lags: {recommended}")

Small sample corrections matter. With fewer than 50 observations, the chi-squared approximation becomes unreliable. Consider using the Box-Pierce test (available via boxpierce=True parameter) or bootstrap methods for small samples.

When the test fails, don’t panic. Systematic steps to follow:

Examine the ACF plot to identify the lag structure
Try adding AR or MA terms at significant lags
Check for seasonality you may have missed
Consider GARCH models if residuals show heteroskedasticity

Limitations to remember:

The test assumes stationarity—apply differencing first if needed
It’s a joint test, so it won’t tell you which specific lag is problematic
Passing the test doesn’t guarantee a good model; it only confirms no linear autocorrelation remains

# Complete diagnostic function
def diagnose_residuals(residuals, model_df=0, lags=None, alpha=0.05):
    """
    Comprehensive residual diagnostics using Ljung-Box test.
    """
    if lags is None:
        lags = recommend_lags(residuals)
    
    results = acorr_ljungbox(residuals, lags=lags, 
                              return_df=True, model_df=model_df)
    
    results['significant'] = results['lb_pvalue'] < alpha
    
    print("Ljung-Box Residual Diagnostics")
    print("=" * 50)
    print(results)
    print("=" * 50)
    
    if results['significant'].any():
        print(f"⚠ WARNING: Significant autocorrelation at lags: "
              f"{results[results['significant']].index.tolist()}")
        return False
    else:
        print("✓ No significant autocorrelation detected.")
        return True

# Run diagnostics
diagnose_residuals(residuals, model_df=1)

The Ljung-Box test is a workhorse of time series diagnostics. Master it, and you’ll catch model specification errors before they corrupt your forecasts.