How to Evaluate Time Series Models in Python

Evaluating time series models isn't just standard machine learning with dates attached. The temporal dependencies in your data fundamentally change how you measure model quality. Use the wrong...

Key Insights

  • Time series evaluation requires temporal train-test splits to prevent data leakage—never shuffle your data or use future information to predict the past
  • Standard metrics like RMSE and MAE tell only part of the story; residual analysis reveals whether your model captures all temporal patterns or leaves systematic errors
  • Cross-validation for time series demands specialized approaches like TimeSeriesSplit or walk-forward validation that respect the sequential nature of your data

Introduction to Time Series Model Evaluation

Evaluating time series models isn’t just standard machine learning with dates attached. The temporal dependencies in your data fundamentally change how you measure model quality. Use the wrong evaluation approach, and you’ll get overly optimistic metrics that collapse in production.

The core challenge is data leakage. In traditional ML, you can randomly split data because observations are independent. Time series data violates this assumption—today’s value depends on yesterday’s. If you accidentally train on future data or test on past data, your metrics will lie to you.

Beyond preventing leakage, you need metrics that capture forecast-specific concerns: Are errors symmetric? Does the model consistently over or under-predict? Do residuals show patterns that indicate missed signal? This article walks through the complete evaluation toolkit for time series models in Python.

Train-Test Splitting for Time Series

The golden rule: always split chronologically. Your test set must come after your training set in time.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate sample time series data
dates = pd.date_range('2020-01-01', periods=365, freq='D')
values = 100 + np.cumsum(np.random.randn(365)) + 10 * np.sin(np.arange(365) * 2 * np.pi / 365)
ts_data = pd.DataFrame({'date': dates, 'value': values}).set_index('date')

# Simple train-test split (80/20)
train_size = int(len(ts_data) * 0.8)
train, test = ts_data[:train_size], ts_data[train_size:]

print(f"Train: {train.index[0]} to {train.index[-1]}")
print(f"Test: {test.index[0]} to {test.index[-1]}")

# Visualize the split
plt.figure(figsize=(12, 4))
plt.plot(train.index, train['value'], label='Train', color='blue')
plt.plot(test.index, test['value'], label='Test', color='orange')
plt.axvline(x=train.index[-1], color='red', linestyle='--', label='Split Point')
plt.legend()
plt.title('Time Series Train-Test Split')
plt.tight_layout()

For more robust evaluation, use rolling or expanding window splits:

def create_rolling_windows(data, train_size=200, test_size=30, step=30):
    """Create multiple train-test splits with rolling windows"""
    splits = []
    for i in range(0, len(data) - train_size - test_size, step):
        train_end = i + train_size
        test_end = train_end + test_size
        splits.append({
            'train': data[i:train_end],
            'test': data[train_end:test_end]
        })
    return splits

# Generate rolling windows
windows = create_rolling_windows(ts_data, train_size=250, test_size=30, step=30)
print(f"Created {len(windows)} rolling windows for evaluation")

Core Evaluation Metrics

Time series forecasting relies heavily on regression metrics, but interpretation differs from standard ML tasks.

from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

def evaluate_forecast(y_true, y_pred):
    """Calculate comprehensive forecast metrics"""
    
    # Mean Absolute Error - average absolute deviation
    mae = mean_absolute_error(y_true, y_pred)
    
    # Root Mean Squared Error - penalizes large errors more
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    
    # Mean Absolute Percentage Error - scale-independent
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    # Symmetric MAPE - handles zero values better
    smape = np.mean(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred))) * 100
    
    # Mean Bias Error - detects systematic over/under prediction
    mbe = np.mean(y_pred - y_true)
    
    return {
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape,
        'sMAPE': smape,
        'MBE': mbe
    }

# Example usage with simple forecast
y_true = test['value'].values
y_pred = y_true + np.random.randn(len(y_true)) * 5  # Simulated predictions

metrics = evaluate_forecast(y_true, y_pred)
for metric, value in metrics.items():
    print(f"{metric}: {value:.4f}")

Choose metrics based on your use case. RMSE heavily penalizes outliers—good for risk-sensitive applications. MAE treats all errors equally. MAPE works across different scales but breaks with zero values. sMAPE fixes that issue. MBE reveals directional bias that other metrics miss.

Advanced Evaluation Techniques

Metrics alone don’t tell you if your model captured all the signal. Residual analysis reveals what your model missed.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
from scipy import stats

def analyze_residuals(y_true, y_pred):
    """Comprehensive residual diagnostics"""
    residuals = y_true - y_pred
    
    # Test for autocorrelation (residuals should be white noise)
    lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True)
    
    # Test for normality
    _, normality_p = stats.normaltest(residuals)
    
    # Create diagnostic plots
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    
    # Residual plot
    axes[0, 0].scatter(range(len(residuals)), residuals, alpha=0.5)
    axes[0, 0].axhline(y=0, color='r', linestyle='--')
    axes[0, 0].set_title('Residual Plot')
    axes[0, 0].set_xlabel('Time')
    axes[0, 0].set_ylabel('Residuals')
    
    # Histogram
    axes[0, 1].hist(residuals, bins=30, edgecolor='black')
    axes[0, 1].set_title('Residual Distribution')
    axes[0, 1].set_xlabel('Residual Value')
    
    # ACF plot
    plot_acf(residuals, lags=20, ax=axes[1, 0])
    axes[1, 0].set_title('Autocorrelation Function')
    
    # Q-Q plot
    stats.probplot(residuals, dist="norm", plot=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot')
    
    plt.tight_layout()
    
    print(f"Ljung-Box Test p-value: {lb_test['lb_pvalue'].values[0]:.4f}")
    print(f"Normality Test p-value: {normality_p:.4f}")
    print("Good model: p-values > 0.05 (residuals are white noise)")
    
    return residuals

# Analyze residuals from our forecast
residuals = analyze_residuals(y_true, y_pred)

If the Ljung-Box test shows significant autocorrelation (p < 0.05), your model missed temporal patterns. Non-normal residuals suggest outliers or model misspecification.

Cross-Validation for Time Series

Single train-test splits are fragile. Cross-validation provides robust performance estimates, but you must respect temporal ordering.

from sklearn.model_selection import TimeSeriesSplit

def time_series_cv_score(data, model_func, n_splits=5):
    """Perform time series cross-validation"""
    tscv = TimeSeriesSplit(n_splits=n_splits)
    scores = []
    
    for fold, (train_idx, test_idx) in enumerate(tscv.split(data)):
        train_data = data.iloc[train_idx]
        test_data = data.iloc[test_idx]
        
        # Fit model and predict (model_func should return predictions)
        predictions = model_func(train_data, test_data)
        
        # Calculate metrics
        fold_metrics = evaluate_forecast(
            test_data['value'].values, 
            predictions
        )
        scores.append(fold_metrics)
        
        print(f"Fold {fold + 1}: RMSE = {fold_metrics['RMSE']:.4f}")
    
    # Average across folds
    avg_scores = {metric: np.mean([s[metric] for s in scores]) 
                  for metric in scores[0].keys()}
    
    return avg_scores, scores

# Example: Simple naive forecast for demonstration
def naive_forecast(train, test):
    """Predict last training value for all test points"""
    return np.full(len(test), train['value'].iloc[-1])

avg_metrics, fold_metrics = time_series_cv_score(ts_data, naive_forecast, n_splits=5)
print(f"\nAverage RMSE across folds: {avg_metrics['RMSE']:.4f}")

TimeSeriesSplit creates increasingly larger training sets, testing on the next chunk each time. This mimics production where you retrain on all historical data.

Comparing Multiple Models

Evaluation shines when comparing models. Build a comparison framework that’s easy to extend.

def compare_models(data, models_dict, train_size=0.8):
    """Compare multiple forecasting models"""
    split_idx = int(len(data) * train_size)
    train, test = data[:split_idx], data[split_idx:]
    
    results = []
    predictions = {}
    
    for name, model_func in models_dict.items():
        # Get predictions
        y_pred = model_func(train, test)
        predictions[name] = y_pred
        
        # Calculate metrics
        metrics = evaluate_forecast(test['value'].values, y_pred)
        metrics['Model'] = name
        results.append(metrics)
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame(results).set_index('Model')
    
    # Visualize forecasts
    plt.figure(figsize=(14, 6))
    plt.plot(train.index, train['value'], label='Train', color='gray', alpha=0.5)
    plt.plot(test.index, test['value'], label='Actual', color='black', linewidth=2)
    
    colors = ['red', 'blue', 'green', 'orange']
    for (name, pred), color in zip(predictions.items(), colors):
        plt.plot(test.index, pred, label=name, color=color, linestyle='--')
    
    plt.legend()
    plt.title('Model Comparison: Forecasts vs Actual')
    plt.tight_layout()
    
    return comparison_df

# Define models to compare
def moving_average_forecast(train, test, window=7):
    last_values = train['value'].tail(window).values
    return np.full(len(test), last_values.mean())

models = {
    'Naive': naive_forecast,
    'MA(7)': lambda tr, te: moving_average_forecast(tr, te, 7),
    'MA(30)': lambda tr, te: moving_average_forecast(tr, te, 30)
}

comparison = compare_models(ts_data, models)
print(comparison.round(4))

Practical Example: End-to-End Evaluation Pipeline

Here’s a complete, reusable evaluation pipeline:

class TimeSeriesEvaluator:
    """Complete evaluation pipeline for time series models"""
    
    def __init__(self, data, target_col='value'):
        self.data = data
        self.target_col = target_col
        self.results = {}
    
    def evaluate_model(self, model_func, model_name, train_size=0.8):
        """Evaluate a single model"""
        split_idx = int(len(self.data) * train_size)
        train = self.data[:split_idx]
        test = self.data[split_idx:]
        
        # Generate predictions
        y_pred = model_func(train, test)
        y_true = test[self.target_col].values
        
        # Calculate metrics
        metrics = evaluate_forecast(y_true, y_pred)
        
        # Analyze residuals
        residuals = y_true - y_pred
        lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True)
        metrics['Ljung_Box_p'] = lb_test['lb_pvalue'].values[0]
        
        # Store results
        self.results[model_name] = {
            'metrics': metrics,
            'predictions': y_pred,
            'actuals': y_true,
            'test_index': test.index
        }
        
        return metrics
    
    def get_comparison_table(self):
        """Generate comparison table across all models"""
        rows = []
        for name, data in self.results.items():
            row = data['metrics'].copy()
            row['Model'] = name
            rows.append(row)
        return pd.DataFrame(rows).set_index('Model')
    
    def plot_forecasts(self):
        """Visualize all model forecasts"""
        plt.figure(figsize=(14, 6))
        
        for name, data in self.results.items():
            plt.plot(data['test_index'], data['predictions'], 
                    label=f"{name} (RMSE: {data['metrics']['RMSE']:.2f})", 
                    linestyle='--', linewidth=2)
        
        # Plot actual values
        first_result = list(self.results.values())[0]
        plt.plot(first_result['test_index'], first_result['actuals'], 
                label='Actual', color='black', linewidth=2)
        
        plt.legend()
        plt.title('Model Forecast Comparison')
        plt.xlabel('Date')
        plt.ylabel('Value')
        plt.tight_layout()

# Use the evaluator
evaluator = TimeSeriesEvaluator(ts_data)
evaluator.evaluate_model(naive_forecast, 'Naive')
evaluator.evaluate_model(lambda tr, te: moving_average_forecast(tr, te, 7), 'MA(7)')

print(evaluator.get_comparison_table())
evaluator.plot_forecasts()

This pipeline handles the complete workflow: splitting data, generating predictions, calculating metrics, analyzing residuals, and visualizing results. Extend it by adding your own models or metrics.

Time series evaluation is about building confidence that your model will perform in production. Use temporal splits, check multiple metrics, analyze residuals, and always validate across different time periods. The code here gives you a foundation to evaluate any forecasting model rigorously.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.