How to Evaluate Time Series Models in Python
Evaluating time series models isn't just standard machine learning with dates attached. The temporal dependencies in your data fundamentally change how you measure model quality. Use the wrong...
Key Insights
- Time series evaluation requires temporal train-test splits to prevent data leakage—never shuffle your data or use future information to predict the past
- Standard metrics like RMSE and MAE tell only part of the story; residual analysis reveals whether your model captures all temporal patterns or leaves systematic errors
- Cross-validation for time series demands specialized approaches like TimeSeriesSplit or walk-forward validation that respect the sequential nature of your data
Introduction to Time Series Model Evaluation
Evaluating time series models isn’t just standard machine learning with dates attached. The temporal dependencies in your data fundamentally change how you measure model quality. Use the wrong evaluation approach, and you’ll get overly optimistic metrics that collapse in production.
The core challenge is data leakage. In traditional ML, you can randomly split data because observations are independent. Time series data violates this assumption—today’s value depends on yesterday’s. If you accidentally train on future data or test on past data, your metrics will lie to you.
Beyond preventing leakage, you need metrics that capture forecast-specific concerns: Are errors symmetric? Does the model consistently over or under-predict? Do residuals show patterns that indicate missed signal? This article walks through the complete evaluation toolkit for time series models in Python.
Train-Test Splitting for Time Series
The golden rule: always split chronologically. Your test set must come after your training set in time.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Generate sample time series data
dates = pd.date_range('2020-01-01', periods=365, freq='D')
values = 100 + np.cumsum(np.random.randn(365)) + 10 * np.sin(np.arange(365) * 2 * np.pi / 365)
ts_data = pd.DataFrame({'date': dates, 'value': values}).set_index('date')
# Simple train-test split (80/20)
train_size = int(len(ts_data) * 0.8)
train, test = ts_data[:train_size], ts_data[train_size:]
print(f"Train: {train.index[0]} to {train.index[-1]}")
print(f"Test: {test.index[0]} to {test.index[-1]}")
# Visualize the split
plt.figure(figsize=(12, 4))
plt.plot(train.index, train['value'], label='Train', color='blue')
plt.plot(test.index, test['value'], label='Test', color='orange')
plt.axvline(x=train.index[-1], color='red', linestyle='--', label='Split Point')
plt.legend()
plt.title('Time Series Train-Test Split')
plt.tight_layout()
For more robust evaluation, use rolling or expanding window splits:
def create_rolling_windows(data, train_size=200, test_size=30, step=30):
"""Create multiple train-test splits with rolling windows"""
splits = []
for i in range(0, len(data) - train_size - test_size, step):
train_end = i + train_size
test_end = train_end + test_size
splits.append({
'train': data[i:train_end],
'test': data[train_end:test_end]
})
return splits
# Generate rolling windows
windows = create_rolling_windows(ts_data, train_size=250, test_size=30, step=30)
print(f"Created {len(windows)} rolling windows for evaluation")
Core Evaluation Metrics
Time series forecasting relies heavily on regression metrics, but interpretation differs from standard ML tasks.
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
def evaluate_forecast(y_true, y_pred):
"""Calculate comprehensive forecast metrics"""
# Mean Absolute Error - average absolute deviation
mae = mean_absolute_error(y_true, y_pred)
# Root Mean Squared Error - penalizes large errors more
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
# Mean Absolute Percentage Error - scale-independent
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
# Symmetric MAPE - handles zero values better
smape = np.mean(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred))) * 100
# Mean Bias Error - detects systematic over/under prediction
mbe = np.mean(y_pred - y_true)
return {
'MAE': mae,
'RMSE': rmse,
'MAPE': mape,
'sMAPE': smape,
'MBE': mbe
}
# Example usage with simple forecast
y_true = test['value'].values
y_pred = y_true + np.random.randn(len(y_true)) * 5 # Simulated predictions
metrics = evaluate_forecast(y_true, y_pred)
for metric, value in metrics.items():
print(f"{metric}: {value:.4f}")
Choose metrics based on your use case. RMSE heavily penalizes outliers—good for risk-sensitive applications. MAE treats all errors equally. MAPE works across different scales but breaks with zero values. sMAPE fixes that issue. MBE reveals directional bias that other metrics miss.
Advanced Evaluation Techniques
Metrics alone don’t tell you if your model captured all the signal. Residual analysis reveals what your model missed.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
from scipy import stats
def analyze_residuals(y_true, y_pred):
"""Comprehensive residual diagnostics"""
residuals = y_true - y_pred
# Test for autocorrelation (residuals should be white noise)
lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True)
# Test for normality
_, normality_p = stats.normaltest(residuals)
# Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Residual plot
axes[0, 0].scatter(range(len(residuals)), residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_title('Residual Plot')
axes[0, 0].set_xlabel('Time')
axes[0, 0].set_ylabel('Residuals')
# Histogram
axes[0, 1].hist(residuals, bins=30, edgecolor='black')
axes[0, 1].set_title('Residual Distribution')
axes[0, 1].set_xlabel('Residual Value')
# ACF plot
plot_acf(residuals, lags=20, ax=axes[1, 0])
axes[1, 0].set_title('Autocorrelation Function')
# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot')
plt.tight_layout()
print(f"Ljung-Box Test p-value: {lb_test['lb_pvalue'].values[0]:.4f}")
print(f"Normality Test p-value: {normality_p:.4f}")
print("Good model: p-values > 0.05 (residuals are white noise)")
return residuals
# Analyze residuals from our forecast
residuals = analyze_residuals(y_true, y_pred)
If the Ljung-Box test shows significant autocorrelation (p < 0.05), your model missed temporal patterns. Non-normal residuals suggest outliers or model misspecification.
Cross-Validation for Time Series
Single train-test splits are fragile. Cross-validation provides robust performance estimates, but you must respect temporal ordering.
from sklearn.model_selection import TimeSeriesSplit
def time_series_cv_score(data, model_func, n_splits=5):
"""Perform time series cross-validation"""
tscv = TimeSeriesSplit(n_splits=n_splits)
scores = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(data)):
train_data = data.iloc[train_idx]
test_data = data.iloc[test_idx]
# Fit model and predict (model_func should return predictions)
predictions = model_func(train_data, test_data)
# Calculate metrics
fold_metrics = evaluate_forecast(
test_data['value'].values,
predictions
)
scores.append(fold_metrics)
print(f"Fold {fold + 1}: RMSE = {fold_metrics['RMSE']:.4f}")
# Average across folds
avg_scores = {metric: np.mean([s[metric] for s in scores])
for metric in scores[0].keys()}
return avg_scores, scores
# Example: Simple naive forecast for demonstration
def naive_forecast(train, test):
"""Predict last training value for all test points"""
return np.full(len(test), train['value'].iloc[-1])
avg_metrics, fold_metrics = time_series_cv_score(ts_data, naive_forecast, n_splits=5)
print(f"\nAverage RMSE across folds: {avg_metrics['RMSE']:.4f}")
TimeSeriesSplit creates increasingly larger training sets, testing on the next chunk each time. This mimics production where you retrain on all historical data.
Comparing Multiple Models
Evaluation shines when comparing models. Build a comparison framework that’s easy to extend.
def compare_models(data, models_dict, train_size=0.8):
"""Compare multiple forecasting models"""
split_idx = int(len(data) * train_size)
train, test = data[:split_idx], data[split_idx:]
results = []
predictions = {}
for name, model_func in models_dict.items():
# Get predictions
y_pred = model_func(train, test)
predictions[name] = y_pred
# Calculate metrics
metrics = evaluate_forecast(test['value'].values, y_pred)
metrics['Model'] = name
results.append(metrics)
# Create comparison DataFrame
comparison_df = pd.DataFrame(results).set_index('Model')
# Visualize forecasts
plt.figure(figsize=(14, 6))
plt.plot(train.index, train['value'], label='Train', color='gray', alpha=0.5)
plt.plot(test.index, test['value'], label='Actual', color='black', linewidth=2)
colors = ['red', 'blue', 'green', 'orange']
for (name, pred), color in zip(predictions.items(), colors):
plt.plot(test.index, pred, label=name, color=color, linestyle='--')
plt.legend()
plt.title('Model Comparison: Forecasts vs Actual')
plt.tight_layout()
return comparison_df
# Define models to compare
def moving_average_forecast(train, test, window=7):
last_values = train['value'].tail(window).values
return np.full(len(test), last_values.mean())
models = {
'Naive': naive_forecast,
'MA(7)': lambda tr, te: moving_average_forecast(tr, te, 7),
'MA(30)': lambda tr, te: moving_average_forecast(tr, te, 30)
}
comparison = compare_models(ts_data, models)
print(comparison.round(4))
Practical Example: End-to-End Evaluation Pipeline
Here’s a complete, reusable evaluation pipeline:
class TimeSeriesEvaluator:
"""Complete evaluation pipeline for time series models"""
def __init__(self, data, target_col='value'):
self.data = data
self.target_col = target_col
self.results = {}
def evaluate_model(self, model_func, model_name, train_size=0.8):
"""Evaluate a single model"""
split_idx = int(len(self.data) * train_size)
train = self.data[:split_idx]
test = self.data[split_idx:]
# Generate predictions
y_pred = model_func(train, test)
y_true = test[self.target_col].values
# Calculate metrics
metrics = evaluate_forecast(y_true, y_pred)
# Analyze residuals
residuals = y_true - y_pred
lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True)
metrics['Ljung_Box_p'] = lb_test['lb_pvalue'].values[0]
# Store results
self.results[model_name] = {
'metrics': metrics,
'predictions': y_pred,
'actuals': y_true,
'test_index': test.index
}
return metrics
def get_comparison_table(self):
"""Generate comparison table across all models"""
rows = []
for name, data in self.results.items():
row = data['metrics'].copy()
row['Model'] = name
rows.append(row)
return pd.DataFrame(rows).set_index('Model')
def plot_forecasts(self):
"""Visualize all model forecasts"""
plt.figure(figsize=(14, 6))
for name, data in self.results.items():
plt.plot(data['test_index'], data['predictions'],
label=f"{name} (RMSE: {data['metrics']['RMSE']:.2f})",
linestyle='--', linewidth=2)
# Plot actual values
first_result = list(self.results.values())[0]
plt.plot(first_result['test_index'], first_result['actuals'],
label='Actual', color='black', linewidth=2)
plt.legend()
plt.title('Model Forecast Comparison')
plt.xlabel('Date')
plt.ylabel('Value')
plt.tight_layout()
# Use the evaluator
evaluator = TimeSeriesEvaluator(ts_data)
evaluator.evaluate_model(naive_forecast, 'Naive')
evaluator.evaluate_model(lambda tr, te: moving_average_forecast(tr, te, 7), 'MA(7)')
print(evaluator.get_comparison_table())
evaluator.plot_forecasts()
This pipeline handles the complete workflow: splitting data, generating predictions, calculating metrics, analyzing residuals, and visualizing results. Extend it by adding your own models or metrics.
Time series evaluation is about building confidence that your model will perform in production. Use temporal splits, check multiple metrics, analyze residuals, and always validate across different time periods. The code here gives you a foundation to evaluate any forecasting model rigorously.