How to Perform the ADF Test for Stationarity in Python

Key Insights

The ADF test checks for stationarity by testing whether a time series has a unit root—reject the null hypothesis (p-value < 0.05) to confirm your data is stationary and suitable for models like ARIMA
Always visualize your data before and after transformations; differencing is the most common fix for non-stationarity, but log transformations work better for series with increasing variance
The number of lags in the ADF test affects results significantly—use automatic lag selection (AIC/BIC) unless you have domain-specific reasons to override it

Introduction to Stationarity

Stationarity is a fundamental assumption for most time series forecasting models. A stationary time series has statistical properties that don’t change over time: constant mean, constant variance, and autocorrelation that depends only on the time lag between observations, not on the actual time.

Why does this matter? Models like ARIMA, VAR, and many machine learning approaches assume stationarity because they learn patterns from historical data. If your data’s mean trends upward over time or variance explodes in recent periods, patterns from 2020 won’t help predict 2024. You’re essentially trying to hit a moving target.

Non-stationary data leads to spurious regressions, unreliable forecasts, and models that perform well in backtesting but fail catastrophically in production. The Augmented Dickey-Fuller test gives you a statistical framework to detect non-stationarity before you waste time building models on unsuitable data.

Understanding the Augmented Dickey-Fuller (ADF) Test

The ADF test evaluates whether a time series has a unit root—a statistical property that makes the series non-stationary. Here’s the hypothesis structure:

Null Hypothesis (H₀): The time series has a unit root (non-stationary)
Alternative Hypothesis (H₁): The time series is stationary

This formulation is crucial. You’re looking to reject the null hypothesis. A p-value below your significance level (typically 0.05) means you reject H₀ and conclude the series is stationary.

The test statistic itself is compared against critical values at different confidence levels (1%, 5%, 10%). If your test statistic is more negative than the critical value, you reject the null hypothesis. Most practitioners focus on the p-value since it’s more intuitive.

Let’s visualize the difference between stationary and non-stationary series:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Stationary: white noise with constant mean
stationary = np.random.normal(loc=10, scale=2, size=200)

# Non-stationary: random walk with drift
non_stationary = np.cumsum(np.random.normal(loc=0.5, scale=1, size=200))

fig, axes = plt.subplots(2, 1, figsize=(12, 6))

axes[0].plot(stationary, color='blue')
axes[0].set_title('Stationary Series (White Noise)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Value')
axes[0].axhline(y=stationary.mean(), color='red', linestyle='--', label='Mean')
axes[0].legend()

axes[1].plot(non_stationary, color='red')
axes[1].set_title('Non-Stationary Series (Random Walk)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Value')
axes[1].set_xlabel('Time')

plt.tight_layout()
plt.show()

The stationary series fluctuates around a constant mean. The non-stationary series trends upward with no tendency to revert to a fixed level.

Implementing the ADF Test with statsmodels

The statsmodels library provides a straightforward implementation. The adfuller() function returns a tuple with six elements:

ADF test statistic: More negative = more evidence against unit root
p-value: Probability of observing the data if null hypothesis is true
Number of lags used: Determined automatically or by your specification
Number of observations: Sample size after accounting for lags
Critical values: Dictionary with 1%, 5%, and 10% thresholds
IC best: Information criterion value (if applicable)

Here’s a practical example using stock price data:

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller

# Simulate stock prices (geometric random walk - typically non-stationary)
np.random.seed(42)
returns = np.random.normal(0.001, 0.02, 252)
prices = 100 * np.exp(np.cumsum(returns))

# Perform ADF test
result = adfuller(prices, autolag='AIC')

print('ADF Test Results:')
print(f'ADF Statistic: {result[0]:.6f}')
print(f'p-value: {result[1]:.6f}')
print(f'Lags used: {result[2]}')
print(f'Number of observations: {result[3]}')
print('Critical Values:')
for key, value in result[4].items():
    print(f'  {key}: {value:.3f}')

# Interpretation
if result[1] <= 0.05:
    print("\nResult: Reject null hypothesis - Data is stationary")
else:
    print("\nResult: Fail to reject null hypothesis - Data is non-stationary")

Output typically shows a p-value well above 0.05, confirming that stock prices are non-stationary. The test statistic will be less negative than the critical values, providing additional evidence.

Handling Non-Stationary Data

When the ADF test confirms non-stationarity, you have several transformation options:

Differencing: Subtract the previous value from the current value. This removes trends and is the most common approach.

Log transformation: Apply log(x) to stabilize variance that increases with the level of the series.

Detrending: Remove a fitted trend line from the data.

Differencing is your first tool. Here’s how to apply it and verify the transformation:

from statsmodels.tsa.stattools import adfuller
import pandas as pd

# Using our previous price data
prices_series = pd.Series(prices)

# First-order differencing
prices_diff = prices_series.diff().dropna()

# Test original series
result_original = adfuller(prices_series, autolag='AIC')
print("Original Series:")
print(f"ADF Statistic: {result_original[0]:.6f}")
print(f"p-value: {result_original[1]:.6f}")

# Test differenced series
result_diff = adfuller(prices_diff, autolag='AIC')
print("\nDifferenced Series:")
print(f"ADF Statistic: {result_diff[0]:.6f}")
print(f"p-value: {result_diff[1]:.6f}")

# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

axes[0].plot(prices_series, color='blue')
axes[0].set_title(f'Original Series (p-value: {result_original[1]:.4f})')
axes[0].set_ylabel('Price')

axes[1].plot(prices_diff, color='green')
axes[1].set_title(f'Differenced Series (p-value: {result_diff[1]:.4f})')
axes[1].set_ylabel('Price Change')
axes[1].set_xlabel('Time')

plt.tight_layout()
plt.show()

After differencing, the p-value typically drops below 0.05, confirming stationarity. The differenced series represents returns rather than prices—a stationary representation suitable for modeling.

Practical Example: End-to-End Pipeline

Let’s build a complete workflow that you can adapt to your projects:

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt

def adf_test(series, name=''):
    """
    Perform ADF test and print formatted results
    """
    result = adfuller(series.dropna(), autolag='AIC')
    
    print(f'\n{"="*50}')
    print(f'ADF Test Results for {name}')
    print(f'{"="*50}')
    print(f'ADF Statistic: {result[0]:.6f}')
    print(f'p-value: {result[1]:.6f}')
    print(f'Lags used: {result[2]}')
    print(f'Critical Values:')
    for key, value in result[4].items():
        print(f'  {key}: {value:.3f}')
    
    if result[1] <= 0.05:
        print(f'\nConclusion: {name} is STATIONARY (reject H0)')
    else:
        print(f'\nConclusion: {name} is NON-STATIONARY (fail to reject H0)')
    
    return result[1]  # Return p-value

# Simulate daily temperature data with trend and seasonality
np.random.seed(42)
days = 365 * 3
trend = np.linspace(15, 17, days)  # Warming trend
seasonal = 10 * np.sin(np.arange(days) * 2 * np.pi / 365)  # Yearly cycle
noise = np.random.normal(0, 2, days)
temperature = trend + seasonal + noise

# Create DataFrame
df = pd.DataFrame({
    'temperature': temperature
}, index=pd.date_range('2021-01-01', periods=days, freq='D'))

# Test original series
p_original = adf_test(df['temperature'], 'Original Temperature')

# Apply first-order differencing
df['temp_diff1'] = df['temperature'].diff()
p_diff1 = adf_test(df['temp_diff1'], 'First-Order Difference')

# If still non-stationary, try second-order differencing
if p_diff1 > 0.05:
    df['temp_diff2'] = df['temp_diff1'].diff()
    p_diff2 = adf_test(df['temp_diff2'], 'Second-Order Difference')

# Visualize transformation pipeline
fig, axes = plt.subplots(3, 1, figsize=(14, 8))

df['temperature'].plot(ax=axes[0], title='Original Series', color='blue')
axes[0].set_ylabel('Temperature (°C)')

df['temp_diff1'].plot(ax=axes[1], title='First Difference', color='green')
axes[1].set_ylabel('Change in Temp')

df['temp_diff2'].plot(ax=axes[2], title='Second Difference', color='red')
axes[2].set_ylabel('Change in Change')
axes[2].set_xlabel('Date')

plt.tight_layout()
plt.show()

This pipeline tests each transformation step and automatically proceeds to higher-order differencing if needed. The helper function standardizes output formatting and returns p-values for programmatic decision-making.

Common Pitfalls and Best Practices

Lag selection matters. The ADF test includes lagged difference terms to account for autocorrelation. Too few lags and you might miss important dynamics; too many and you lose power. Always use autolag='AIC' or autolag='BIC' unless you have specific domain knowledge.

# Demonstrate impact of lag selection
from statsmodels.tsa.stattools import adfuller

# Test with different lag specifications
for lag_method in [None, 'AIC', 'BIC']:
    result = adfuller(prices, autolag=lag_method, maxlag=20)
    print(f"\nLag method: {lag_method}")
    print(f"Lags used: {result[2]}")
    print(f"p-value: {result[1]:.6f}")

Sample size requirements: The ADF test needs sufficient observations. With less than 50 data points, results become unreliable. Aim for 100+ observations when possible.

Seasonal vs. trend stationarity: A series can be trend-stationary (stationary after detrending) but fail the ADF test. If you have strong seasonality, consider seasonal differencing (lag-12 for monthly data, lag-7 for daily data with weekly patterns) before applying the ADF test.

Structural breaks: The ADF test assumes consistent behavior throughout the series. Major regime changes (like COVID-19’s impact on economic data) can cause false negatives. Split your data at known break points and test each segment separately.

Don’t over-difference: Differencing too many times introduces unnecessary complexity and can make your series harder to model. If first-order differencing achieves stationarity (p-value < 0.05), stop there.

The ADF test is your first line of defense against non-stationary data, but combine it with visual inspection and domain knowledge. A p-value of 0.051 doesn’t mean your data is fundamentally different from data with p-value 0.049—use the test as a guide, not an absolute rule.