How to Perform a Granger Causality Test in Python

Key Insights

Granger causality tests whether one time series helps predict another—it’s about predictive power, not true causation, and both series must be stationary before testing.
Always test in both directions; just because X Granger-causes Y doesn’t mean Y doesn’t also Granger-cause X, and ignoring this leads to incomplete conclusions.
Lag selection matters enormously—too few lags miss important relationships, too many inflate false positives and reduce statistical power.

Introduction to Granger Causality

Granger causality is one of the most misunderstood concepts in time series analysis. Despite its name, it doesn’t prove causation. Instead, it answers a specific question: does knowing the past values of series X improve our ability to predict series Y, beyond what we could predict using Y’s past values alone?

Clive Granger developed this test in 1969, and it’s become a workhorse in fields where understanding lead-lag relationships matters. Economists use it to study whether money supply changes predict GDP growth. Neuroscientists apply it to determine if activity in one brain region precedes and predicts activity in another. Quantitative traders test whether one asset’s price movements provide predictive information about another.

The key insight is probabilistic: if X Granger-causes Y, then past values of X contain information that helps predict Y’s future values, even after accounting for Y’s own history. This is useful, but it’s not the same as saying X causes Y in any philosophical or mechanistic sense.

Statistical Prerequisites

Before running a Granger causality test, you need to satisfy one critical requirement: both time series must be stationary. A stationary series has constant mean, variance, and autocorrelation structure over time. Most real-world data—stock prices, GDP, temperatures—are non-stationary.

Why does stationarity matter? The Granger test relies on regression, and regressing one non-stationary series on another produces spurious results. You’ll find “significant” relationships that don’t exist.

The Augmented Dickey-Fuller (ADF) test is the standard tool for checking stationarity. Here’s how to use it:

import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import adfuller

def check_stationarity(series, name="Series"):
    """
    Perform ADF test and return stationarity assessment.
    """
    result = adfuller(series.dropna(), autolag='AIC')
    
    print(f"ADF Test for {name}")
    print(f"  Test Statistic: {result[0]:.4f}")
    print(f"  p-value: {result[1]:.4f}")
    print(f"  Lags Used: {result[2]}")
    print(f"  Critical Values:")
    for key, value in result[4].items():
        print(f"    {key}: {value:.4f}")
    
    is_stationary = result[1] < 0.05
    print(f"  Conclusion: {'Stationary' if is_stationary else 'Non-stationary'}")
    
    return is_stationary

The null hypothesis of the ADF test is that the series has a unit root (non-stationary). A p-value below 0.05 means you reject the null and conclude stationarity.

Lag selection is the second prerequisite concept. The Granger test asks whether past values of X predict Y, but how far back should we look? Information criteria like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) help determine optimal lag length by balancing model fit against complexity.

Preparing Your Time Series Data

Let’s work with a practical example using economic data. We’ll examine whether industrial production Granger-causes retail sales, using Federal Reserve Economic Data.

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller

# Create sample data simulating economic indicators
np.random.seed(42)
n_periods = 200

# Simulate industrial production (leading indicator)
industrial_prod = np.zeros(n_periods)
industrial_prod[0] = 100
for t in range(1, n_periods):
    industrial_prod[t] = industrial_prod[t-1] + np.random.normal(0.1, 1)

# Simulate retail sales (lagged response to industrial production)
retail_sales = np.zeros(n_periods)
retail_sales[0] = 50
for t in range(1, n_periods):
    # Retail sales responds to industrial production with a lag
    lag_effect = 0.3 * (industrial_prod[t-1] - industrial_prod[t-2]) if t > 1 else 0
    retail_sales[t] = retail_sales[t-1] + lag_effect + np.random.normal(0.05, 0.8)

# Create DataFrame
dates = pd.date_range(start='2010-01-01', periods=n_periods, freq='M')
df = pd.DataFrame({
    'industrial_production': industrial_prod,
    'retail_sales': retail_sales
}, index=dates)

print(df.head(10))

Now check stationarity for both series:

# Check stationarity of raw series
print("=" * 50)
ip_stationary = check_stationarity(df['industrial_production'], "Industrial Production")
print()
rs_stationary = check_stationarity(df['retail_sales'], "Retail Sales")

If either series is non-stationary (and they likely will be), apply differencing:

def make_stationary(series, name="Series", max_diffs=2):
    """
    Apply differencing until series is stationary.
    Returns differenced series and number of differences applied.
    """
    diff_series = series.copy()
    n_diffs = 0
    
    for i in range(max_diffs + 1):
        if check_stationarity(diff_series, f"{name} (d={i})"):
            return diff_series, n_diffs
        if i < max_diffs:
            diff_series = diff_series.diff().dropna()
            n_diffs += 1
    
    print(f"Warning: {name} not stationary after {max_diffs} differences")
    return diff_series, n_diffs

# Make both series stationary
df_stationary = pd.DataFrame()
df_stationary['industrial_production'], ip_diffs = make_stationary(
    df['industrial_production'], "Industrial Production"
)
df_stationary['retail_sales'], rs_diffs = make_stationary(
    df['retail_sales'], "Retail Sales"
)

# Align the series (differencing creates NaN values)
df_stationary = df_stationary.dropna()
print(f"\nFinal dataset shape: {df_stationary.shape}")

Performing the Granger Causality Test

With stationary data in hand, we can run the Granger causality test using statsmodels:

from statsmodels.tsa.stattools import grangercausalitytests

# Prepare data: Granger test expects a 2D array with [effect, cause] column order
# Testing: Does industrial production Granger-cause retail sales?
data_for_test = df_stationary[['retail_sales', 'industrial_production']].values

print("Testing: Industrial Production -> Retail Sales")
print("=" * 60)

# Run Granger causality test with multiple lag values
max_lag = 8
gc_results = grangercausalitytests(data_for_test, maxlag=max_lag, verbose=True)

The grangercausalitytests function takes two key parameters:

maxlag: Tests all lags from 1 to this value
verbose: When True, prints detailed results for each lag

The column order matters critically. The function tests whether the second column Granger-causes the first column. Getting this backwards inverts your conclusions.

Interpreting Results

The output includes four test statistics for each lag: F-test, chi-squared (ssr_chi2test), likelihood ratio, and parameter F-test. In practice, focus on the F-test and its p-value.

Here’s how to parse results programmatically:

def granger_causality_analysis(data, cause_col, effect_col, max_lag=8):
    """
    Perform Granger causality test and return structured results.
    """
    # Column order: [effect, cause]
    test_data = data[[effect_col, cause_col]].values
    
    # Run test (verbose=False to suppress output)
    results = grangercausalitytests(test_data, maxlag=max_lag, verbose=False)
    
    # Extract results into a clean DataFrame
    summary = []
    for lag in range(1, max_lag + 1):
        f_stat = results[lag][0]['ssr_ftest'][0]
        f_pvalue = results[lag][0]['ssr_ftest'][1]
        chi2_stat = results[lag][0]['ssr_chi2test'][0]
        chi2_pvalue = results[lag][0]['ssr_chi2test'][1]
        
        summary.append({
            'lag': lag,
            'f_statistic': f_stat,
            'f_pvalue': f_pvalue,
            'chi2_statistic': chi2_stat,
            'chi2_pvalue': chi2_pvalue,
            'significant_at_05': f_pvalue < 0.05
        })
    
    summary_df = pd.DataFrame(summary)
    
    # Find optimal lag (lowest p-value among significant results)
    significant = summary_df[summary_df['significant_at_05']]
    if len(significant) > 0:
        optimal_lag = significant.loc[significant['f_pvalue'].idxmin(), 'lag']
        conclusion = f"{cause_col} Granger-causes {effect_col} (optimal lag: {int(optimal_lag)})"
    else:
        conclusion = f"No evidence that {cause_col} Granger-causes {effect_col}"
    
    return summary_df, conclusion

# Run analysis
results_df, conclusion = granger_causality_analysis(
    df_stationary, 
    cause_col='industrial_production',
    effect_col='retail_sales',
    max_lag=8
)

print("\nGranger Causality Test Results")
print("=" * 60)
print(results_df.to_string(index=False))
print(f"\nConclusion: {conclusion}")

The null hypothesis states that X does not Granger-cause Y. A p-value below 0.05 means you reject this null—X does provide predictive information about Y. But be careful with multiple testing: checking many lags inflates your false positive rate.

Common Pitfalls and Best Practices

The biggest mistake is testing only one direction. Granger causality can be bidirectional, unidirectional, or absent in both directions. Always test both:

def bidirectional_granger_test(data, col1, col2, max_lag=8, alpha=0.05):
    """
    Test Granger causality in both directions and summarize findings.
    """
    results = {}
    
    # Test col1 -> col2
    df1, conclusion1 = granger_causality_analysis(data, col1, col2, max_lag)
    results[f'{col1} -> {col2}'] = {
        'results': df1,
        'significant_lags': df1[df1['f_pvalue'] < alpha]['lag'].tolist(),
        'min_pvalue': df1['f_pvalue'].min()
    }
    
    # Test col2 -> col1
    df2, conclusion2 = granger_causality_analysis(data, col2, col1, max_lag)
    results[f'{col2} -> {col1}'] = {
        'results': df2,
        'significant_lags': df2[df2['f_pvalue'] < alpha]['lag'].tolist(),
        'min_pvalue': df2['f_pvalue'].min()
    }
    
    # Determine relationship type
    sig1 = len(results[f'{col1} -> {col2}']['significant_lags']) > 0
    sig2 = len(results[f'{col2} -> {col1}']['significant_lags']) > 0
    
    if sig1 and sig2:
        relationship = "Bidirectional (feedback)"
    elif sig1:
        relationship = f"Unidirectional: {col1} -> {col2}"
    elif sig2:
        relationship = f"Unidirectional: {col2} -> {col1}"
    else:
        relationship = "No Granger causality detected"
    
    print(f"\nBidirectional Granger Causality Analysis")
    print("=" * 60)
    for direction, data in results.items():
        sig_lags = data['significant_lags']
        status = f"Significant at lags {sig_lags}" if sig_lags else "Not significant"
        print(f"{direction}: {status} (min p={data['min_pvalue']:.4f})")
    print(f"\nRelationship: {relationship}")
    
    return results, relationship

# Run bidirectional test
bidir_results, relationship = bidirectional_granger_test(
    df_stationary,
    'industrial_production',
    'retail_sales',
    max_lag=8
)

Other critical pitfalls to avoid:

Confounding variables: If a third variable drives both X and Y, you’ll find spurious Granger causality. The test can’t detect this—you need domain knowledge.

Sample size: With fewer than 50-100 observations per series, results become unreliable. More lags require more data.

Structural breaks: If the relationship between series changes over time (regime shifts), pooled tests give misleading results. Consider rolling window analysis.

Conclusion

The Granger causality workflow is straightforward: verify stationarity, difference if needed, run the test in both directions, and interpret p-values carefully. Remember that statistical significance doesn’t equal practical significance, and Granger causality doesn’t equal true causation.

For deeper analysis, consider Vector Autoregression (VAR) models, which generalize Granger causality to systems with multiple interacting time series. The statsmodels.tsa.api.VAR class provides this functionality and includes built-in Granger causality tests as part of model diagnostics.

Use Granger causality as an exploratory tool to identify potential predictive relationships, but validate findings with domain expertise and out-of-sample testing before acting on them.