Time Series Autocorrelation Explained
Autocorrelation is the correlation between a time series and a lagged version of itself. While simple correlation measures the relationship between two different variables, autocorrelation examines...
Key Insights
- Autocorrelation measures how a time series correlates with lagged versions of itself, revealing hidden patterns like seasonality and trends that aren’t obvious from raw data visualization alone
- ACF plots show total correlation at each lag while PACF plots isolate direct relationships, making them essential tools for identifying the right ARIMA model parameters
- A slowly decaying ACF indicates non-stationarity requiring differencing, while sharp cutoffs in ACF or PACF plots directly inform the MA and AR order selection
Understanding Autocorrelation
Autocorrelation is the correlation between a time series and a lagged version of itself. While simple correlation measures the relationship between two different variables, autocorrelation examines how past values in a single series relate to current values. This concept is fundamental to time series analysis because temporal data has memory—what happened yesterday often influences what happens today.
Understanding autocorrelation matters for three critical reasons: it reveals hidden patterns in your data, helps you select appropriate forecasting models, and allows you to validate model performance by checking residuals. Without examining autocorrelation, you’re essentially flying blind when building time series models.
Let’s start with a simple example showing sales data with clear patterns:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Generate synthetic sales data with trend and seasonality
np.random.seed(42)
time = np.arange(0, 365)
trend = 0.1 * time
seasonality = 10 * np.sin(2 * np.pi * time / 30) # 30-day cycle
noise = np.random.normal(0, 2, len(time))
sales = 100 + trend + seasonality + noise
df = pd.DataFrame({'date': pd.date_range('2023-01-01', periods=365),
'sales': sales})
plt.figure(figsize=(12, 4))
plt.plot(df['date'], df['sales'])
plt.title('Daily Sales with Trend and Seasonality')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.tight_layout()
plt.show()
This data exhibits both an upward trend and monthly seasonality. Autocorrelation will help us quantify these patterns mathematically.
The Mathematics Behind Autocorrelation
The autocorrelation function (ACF) at lag k measures the correlation between observations that are k time periods apart. The formula is:
ρ(k) = Cov(y_t, y_{t-k}) / Var(y_t)
Where ρ(k) is the autocorrelation coefficient at lag k, ranging from -1 to +1. A value near +1 indicates strong positive correlation (values tend to move together), near -1 indicates strong negative correlation (values move in opposite directions), and near 0 indicates no linear relationship.
Let’s manually calculate the lag-1 autocorrelation to understand the mechanics:
def calculate_autocorr_manual(series, lag=1):
"""Manually calculate autocorrelation at specified lag"""
n = len(series)
mean = series.mean()
# Calculate variance
variance = np.sum((series - mean) ** 2) / n
# Calculate autocovariance at lag k
y_t = series[lag:]
y_t_lag = series[:-lag]
autocovariance = np.sum((y_t - mean) * (y_t_lag - mean)) / n
# Autocorrelation = autocovariance / variance
return autocovariance / variance
# Calculate lag-1 autocorrelation manually
lag1_manual = calculate_autocorr_manual(df['sales'].values, lag=1)
print(f"Manual lag-1 autocorrelation: {lag1_manual:.4f}")
# Verify with pandas built-in method
lag1_pandas = df['sales'].autocorr(lag=1)
print(f"Pandas lag-1 autocorrelation: {lag1_pandas:.4f}")
For our seasonal sales data, you’ll see a high positive lag-1 autocorrelation (typically above 0.9), meaning today’s sales strongly predict tomorrow’s sales.
Reading ACF and PACF Plots
While individual autocorrelation values are useful, plotting the entire autocorrelation function reveals patterns across all lags. The ACF plot shows correlation at each lag, while the partial autocorrelation function (PACF) shows the correlation at lag k after removing the effects of shorter lags.
Think of PACF as the “direct” correlation between observations k periods apart, controlling for intermediate observations. This distinction becomes crucial for model selection.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf
fig, axes = plt.subplots(3, 2, figsize=(12, 10))
# Generate different time series patterns
np.random.seed(42)
n = 200
# 1. White noise (random)
white_noise = np.random.normal(0, 1, n)
# 2. Random walk (non-stationary)
random_walk = np.cumsum(np.random.normal(0, 1, n))
# 3. Seasonal pattern
seasonal = 10 * np.sin(2 * np.pi * np.arange(n) / 12) + np.random.normal(0, 1, n)
# Plot ACF and PACF for each pattern
patterns = [
(white_noise, 'White Noise'),
(random_walk, 'Random Walk'),
(seasonal, 'Seasonal Pattern')
]
for idx, (data, title) in enumerate(patterns):
plot_acf(data, lags=40, ax=axes[idx, 0], alpha=0.05)
axes[idx, 0].set_title(f'ACF: {title}')
plot_pacf(data, lags=40, ax=axes[idx, 1], alpha=0.05)
axes[idx, 1].set_title(f'PACF: {title}')
plt.tight_layout()
plt.show()
Key patterns to recognize:
- White noise: ACF and PACF values all fall within confidence bands (no significant correlation)
- Random walk: ACF decays very slowly, indicating non-stationarity
- Seasonal data: ACF shows peaks at seasonal lags (e.g., every 12 months)
The confidence bands (typically at 95%) help determine statistical significance. Values outside these bands indicate significant autocorrelation.
Identifying Time Series Patterns
Autocorrelation reveals three main patterns: trends, seasonality, and cycles. Each has a distinct ACF signature.
Positive autocorrelation at low lags indicates persistence—values tend to stay similar over short periods. This is common in financial data and temperature readings. Negative autocorrelation is rarer but appears in mean-reverting processes.
Seasonality manifests as periodic spikes in the ACF at regular intervals. Monthly seasonality shows peaks at lags 12, 24, 36, etc. Daily seasonality in hourly data shows peaks at lag 24.
# Compare different patterns
np.random.seed(42)
n = 300
# Random walk: strong persistence
random_walk = np.cumsum(np.random.normal(0, 1, n))
# Strong seasonality (period=12)
seasonal = 5 * np.sin(2 * np.pi * np.arange(n) / 12) + np.random.normal(0, 0.5, n)
# White noise: no pattern
white_noise = np.random.normal(0, 1, n)
fig, axes = plt.subplots(3, 1, figsize=(12, 9))
datasets = [
(random_walk, 'Random Walk (Non-Stationary)'),
(seasonal, 'Seasonal Pattern (Period=12)'),
(white_noise, 'White Noise (No Correlation)')
]
for idx, (data, title) in enumerate(datasets):
plot_acf(data, lags=50, ax=axes[idx], alpha=0.05)
axes[idx].set_title(title)
plt.tight_layout()
plt.show()
The random walk shows slow, linear decay—a telltale sign you need differencing. The seasonal pattern shows clear peaks every 12 lags. White noise shows no significant correlations.
Using Autocorrelation for Model Selection
ACF and PACF plots are your primary tools for selecting ARIMA(p,d,q) parameters:
- d (differencing order): If ACF decays slowly, difference the series until it becomes stationary
- p (AR order): Look at PACF—a sharp cutoff after lag p suggests AR(p)
- q (MA order): Look at ACF—a sharp cutoff after lag q suggests MA(q)
Here’s a practical workflow:
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
# Generate AR(2) process for demonstration
np.random.seed(42)
n = 500
ar_params = np.array([0.6, 0.3])
ma_params = np.array([])
ar = np.r_[1, -ar_params]
ma = np.r_[1, ma_params]
from statsmodels.tsa.arima_process import ArmaProcess
ar_process = ArmaProcess(ar, ma)
y = ar_process.generate_sample(n)
# Plot ACF and PACF
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(y, lags=20, ax=axes[0])
plot_pacf(y, lags=20, ax=axes[1])
axes[0].set_title('ACF: AR(2) Process')
axes[1].set_title('PACF: AR(2) Process')
plt.tight_layout()
plt.show()
# PACF shows cutoff after lag 2, suggesting AR(2)
# Fit ARIMA(2,0,0) model
model = ARIMA(y, order=(2, 0, 0))
results = model.fit()
print(results.summary())
# Check residuals - should look like white noise
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(results.resid, lags=20, ax=axes[0])
axes[0].set_title('Residual ACF')
plot_pacf(results.resid, lags=20, ax=axes[1])
axes[1].set_title('Residual PACF')
plt.tight_layout()
plt.show()
If your residuals show no significant autocorrelation (all values within confidence bands), your model has captured the temporal structure adequately.
Complete Practical Workflow
Let’s apply everything to a real-world scenario. We’ll analyze a time series, identify patterns, select a model, and validate results:
# Generate realistic daily temperature data
np.random.seed(42)
days = 365 * 3 # 3 years
time = np.arange(days)
# Annual seasonality + trend + noise
annual_pattern = 15 * np.sin(2 * np.pi * time / 365)
trend = 0.005 * time # slight warming trend
noise = np.random.normal(0, 3, days)
temperature = 20 + annual_pattern + trend + noise
df = pd.DataFrame({
'date': pd.date_range('2020-01-01', periods=days),
'temp': temperature
})
# Step 1: Visualize the data
plt.figure(figsize=(12, 4))
plt.plot(df['date'], df['temp'])
plt.title('Daily Temperature Over 3 Years')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.show()
# Step 2: Check stationarity
adf_result = adfuller(df['temp'])
print(f"ADF Statistic: {adf_result[0]:.4f}")
print(f"p-value: {adf_result[1]:.4f}")
# Step 3: Examine ACF/PACF
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(df['temp'], lags=100, ax=axes[0])
plot_pacf(df['temp'], lags=100, ax=axes[1])
plt.tight_layout()
plt.show()
# Step 4: Difference if needed (remove trend)
df['temp_diff'] = df['temp'].diff()
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(df['temp_diff'].dropna(), lags=100, ax=axes[0])
plot_pacf(df['temp_diff'].dropna(), lags=100, ax=axes[1])
axes[0].set_title('ACF: Differenced Series')
axes[1].set_title('PACF: Differenced Series')
plt.tight_layout()
plt.show()
# Step 5: Identify seasonal pattern and fit SARIMA
# ACF shows peaks at 365, suggesting annual seasonality
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(df['temp'],
order=(1, 1, 1),
seasonal_order=(1, 0, 1, 365))
results = model.fit(disp=False)
# Step 6: Validate with residual diagnostics
residuals = results.resid
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Residual plot
axes[0, 0].plot(residuals)
axes[0, 0].set_title('Residuals Over Time')
# Histogram
axes[0, 1].hist(residuals, bins=30)
axes[0, 1].set_title('Residual Distribution')
# ACF of residuals
plot_acf(residuals, lags=40, ax=axes[1, 0])
axes[1, 0].set_title('Residual ACF')
# PACF of residuals
plot_pacf(residuals, lags=40, ax=axes[1, 1])
axes[1, 1].set_title('Residual PACF')
plt.tight_layout()
plt.show()
print(f"\nResidual mean: {residuals.mean():.4f}")
print(f"Residual std: {residuals.std():.4f}")
Good residuals should resemble white noise: mean near zero, constant variance, and no significant autocorrelation. If residuals still show patterns, your model hasn’t captured all the temporal structure.
Autocorrelation analysis is not optional in time series work—it’s the foundation for understanding temporal dependencies and building effective forecasting models. Master ACF and PACF interpretation, and you’ll make better modeling decisions every time.