How to Calculate MAE for Time Series in Python

Key Insights

MAE measures forecast accuracy in the same units as your data, making it immediately interpretable—a MAE of 5°C means your temperature predictions are off by an average of 5 degrees
Always respect temporal order when splitting time series data for evaluation; random train/test splits will leak future information and give misleadingly optimistic MAE scores
Calculate MAE across multiple forecast horizons to understand how your model’s accuracy degrades over time, as most time series models perform worse for longer-range predictions

Introduction to MAE in Time Series Forecasting

Mean Absolute Error (MAE) is one of the most straightforward and interpretable metrics for evaluating time series forecasts. Unlike RMSE (Root Mean Squared Error), which penalizes large errors more heavily, MAE treats all errors equally. This makes it robust to outliers and easier to explain to non-technical stakeholders.

The key advantage of MAE for time series work is its interpretability. If you’re forecasting daily sales and get a MAE of 150, you know your predictions are off by an average of 150 units. There’s no mathematical transformation obscuring the result—the error is in the same units as your original data.

Use MAE when you want a balanced view of forecast accuracy and when outliers shouldn’t dominate your error metric. Choose RMSE when large errors are particularly costly in your application. Avoid MAPE (Mean Absolute Percentage Error) when your time series contains zeros or very small values, as it will explode to infinity.

Understanding the MAE Formula

The MAE formula is deceptively simple:

MAE = (1/n) * Σ|actual - predicted|

Breaking this down:

n is the number of observations
actual represents your true values
predicted represents your model’s forecasts
|actual - predicted| is the absolute difference between them
Σ means we sum all these absolute differences
We divide by n to get the average

The absolute value ensures that overestimates and underestimates don’t cancel each other out. Without it, a prediction of +10 and -10 would appear perfect when averaged, even though both are wrong.

Here’s a manual calculation using NumPy to illustrate the concept:

import numpy as np

# Actual values from our time series
actual = np.array([100, 105, 98, 110, 115])

# Predictions from our model
predicted = np.array([102, 103, 100, 108, 118])

# Calculate absolute errors
absolute_errors = np.abs(actual - predicted)
print(f"Absolute errors: {absolute_errors}")
# Output: [2 2 2 2 3]

# Calculate MAE
mae = np.mean(absolute_errors)
print(f"MAE: {mae}")
# Output: MAE: 2.2

This tells us our model’s predictions are off by an average of 2.2 units. Simple, interpretable, actionable.

Calculating MAE with Scikit-learn

In production code, you’ll want to use scikit-learn’s built-in implementation rather than rolling your own. It’s well-tested, handles edge cases, and integrates seamlessly with other sklearn tools.

Here’s a practical example using real-world-style time series data:

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

# Create sample time series data (e.g., daily temperature)
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365, freq='D')
temperature = 15 + 10 * np.sin(np.arange(365) * 2 * np.pi / 365) + np.random.normal(0, 2, 365)

df = pd.DataFrame({'date': dates, 'temperature': temperature})
df['day_of_year'] = df['date'].dt.dayofyear

# Split data temporally (first 300 days for training, last 65 for testing)
train = df.iloc[:300]
test = df.iloc[300:]

# Simple linear model for demonstration
model = LinearRegression()
model.fit(train[['day_of_year']], train['temperature'])

# Generate predictions
predictions = model.predict(test[['day_of_year']])

# Calculate MAE
mae = mean_absolute_error(test['temperature'], predictions)
print(f"Test MAE: {mae:.2f}°C")

The sklearn implementation handles missing values gracefully and validates input shapes, making it more robust than a manual calculation.

Calculating MAE for Multiple Time Series

Real-world scenarios often require evaluating forecasts across multiple time series or using rolling window validation. Here’s how to handle both:

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error

# Simulate multiple time series (e.g., sales across different stores)
np.random.seed(42)
n_series = 5
n_points = 100

mae_scores = []

for i in range(n_series):
    # Generate synthetic time series
    actual = np.cumsum(np.random.randn(n_points)) + 100
    # Add some forecast error
    predicted = actual + np.random.normal(0, 5, n_points)
    
    mae = mean_absolute_error(actual, predicted)
    mae_scores.append(mae)
    print(f"Series {i+1} MAE: {mae:.2f}")

# Overall MAE across all series
print(f"\nMean MAE across all series: {np.mean(mae_scores):.2f}")
print(f"Std MAE across all series: {np.std(mae_scores):.2f}")

For rolling window validation, which is crucial for time series:

def rolling_window_mae(data, model, window_size=30, forecast_horizon=7):
    """Calculate MAE using rolling window validation."""
    mae_scores = []
    
    for i in range(window_size, len(data) - forecast_horizon):
        # Train on window
        train_window = data[i-window_size:i]
        
        # Predict next forecast_horizon points
        # (Simplified - actual implementation depends on your model)
        actual = data[i:i+forecast_horizon]
        predicted = model.predict(train_window, forecast_horizon)
        
        mae = mean_absolute_error(actual, predicted)
        mae_scores.append(mae)
    
    return np.array(mae_scores)

# This shows how MAE varies across different time periods
# Useful for detecting when your model performs poorly

Visualizing MAE and Prediction Errors

Numbers alone don’t tell the whole story. Visualization helps you understand where and why your model fails:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_absolute_error

# Using our earlier temperature example
np.random.seed(42)
n_points = 100
actual = 15 + 10 * np.sin(np.arange(n_points) * 2 * np.pi / 365) + np.random.normal(0, 2, n_points)
predicted = actual + np.random.normal(0, 3, n_points)

# Plot actual vs predicted
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Actual vs Predicted time series
axes[0, 0].plot(actual, label='Actual', alpha=0.7)
axes[0, 0].plot(predicted, label='Predicted', alpha=0.7)
axes[0, 0].set_title('Actual vs Predicted Values')
axes[0, 0].legend()
axes[0, 0].set_xlabel('Time')
axes[0, 0].set_ylabel('Value')

# Scatter plot
axes[0, 1].scatter(actual, predicted, alpha=0.5)
axes[0, 1].plot([actual.min(), actual.max()], 
                [actual.min(), actual.max()], 
                'r--', label='Perfect prediction')
axes[0, 1].set_title('Prediction Scatter Plot')
axes[0, 1].set_xlabel('Actual')
axes[0, 1].set_ylabel('Predicted')
axes[0, 1].legend()

# Error distribution
errors = actual - predicted
axes[1, 0].hist(errors, bins=30, edgecolor='black')
axes[1, 0].axvline(0, color='r', linestyle='--', label='Zero error')
axes[1, 0].set_title(f'Error Distribution (MAE: {mean_absolute_error(actual, predicted):.2f})')
axes[1, 0].set_xlabel('Prediction Error')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# Absolute errors over time
abs_errors = np.abs(errors)
axes[1, 1].plot(abs_errors)
axes[1, 1].axhline(mean_absolute_error(actual, predicted), 
                   color='r', linestyle='--', label='MAE')
axes[1, 1].set_title('Absolute Errors Over Time')
axes[1, 1].set_xlabel('Time')
axes[1, 1].set_ylabel('Absolute Error')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('mae_analysis.png', dpi=300, bbox_inches='tight')

The scatter plot quickly reveals systematic bias (points consistently above or below the diagonal), while the error-over-time plot shows if accuracy degrades in specific periods.

Best Practices and Common Pitfalls

The most critical mistake in time series evaluation is improper data splitting. Never use random train/test splits:

# WRONG - Random split leaks future information
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# CORRECT - Temporal split respects time ordering
split_point = int(len(data) * 0.8)
train = data[:split_point]
test = data[split_point:]

When comparing MAE across different time series with different scales, consider normalizing:

# Compare models across different-scaled time series
def normalized_mae(actual, predicted):
    """MAE normalized by the range of actual values."""
    mae = mean_absolute_error(actual, predicted)
    data_range = np.max(actual) - np.min(actual)
    return mae / data_range if data_range > 0 else mae

# Or use percentage terms
def mae_percentage(actual, predicted):
    """MAE as percentage of mean actual value."""
    mae = mean_absolute_error(actual, predicted)
    return (mae / np.mean(np.abs(actual))) * 100

For seasonal data, calculate MAE separately for different seasons to identify where your model struggles:

# Seasonal MAE analysis
def seasonal_mae(dates, actual, predicted, freq='M'):
    """Calculate MAE by season/month."""
    df = pd.DataFrame({
        'date': dates,
        'actual': actual,
        'predicted': predicted
    })
    df['period'] = df['date'].dt.to_period(freq)
    
    return df.groupby('period').apply(
        lambda x: mean_absolute_error(x['actual'], x['predicted'])
    )

Always calculate MAE on a held-out test set that the model has never seen. Cross-validation for time series requires special techniques like TimeSeriesSplit to maintain temporal ordering.

Conclusion

MAE is your go-to metric for time series evaluation when you need interpretability and robustness. Its simplicity is its strength—stakeholders immediately understand what a MAE of 5 units means without needing a statistics degree.

Remember to respect temporal ordering in your train/test splits, visualize your errors to catch systematic issues, and calculate MAE across different forecast horizons to understand how accuracy degrades over time. When comparing models on different-scaled time series, normalize your MAE values to make fair comparisons.

Start with MAE as your baseline metric, then consider RMSE if your application particularly penalizes large errors, or custom metrics if you have asymmetric costs for over- vs. under-prediction.