How to Implement ARIMA in Python

Key Insights

ARIMA models require stationary data—always test for stationarity with the Augmented Dickey-Fuller test and difference your data if needed before fitting
The three parameters (p, d, q) represent autoregressive order, differencing degree, and moving average order—use ACF/PACF plots or auto_arima to identify optimal values
Split your data into train/test sets and validate model performance with RMSE and MAE metrics rather than relying solely on in-sample fit statistics

Understanding ARIMA and When to Use It

ARIMA (AutoRegressive Integrated Moving Average) is a statistical model designed for univariate time series forecasting. It works best with data that exhibits temporal dependencies but no strong seasonal patterns. Think stock prices, website traffic without weekly cycles, or temperature trends.

The model combines three components:

AR (p): Autoregressive terms that use past values to predict future ones
I (d): Differencing operations to make the series stationary
MA (q): Moving average terms that model the relationship between observations and past forecast errors

ARIMA excels with short to medium-term forecasts when you have at least 50-100 observations and your data shows trends or momentum. It struggles with seasonal data (use SARIMA instead) and long-term predictions where uncertainty compounds rapidly.

Setting Up Your Environment

Install the necessary libraries before starting. You’ll need statsmodels for ARIMA implementation, pandas for data manipulation, and matplotlib for visualization.

pip install pandas numpy statsmodels matplotlib pmdarima

Here’s a complete setup with a real-world dataset—airline passenger numbers, a classic time series example:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv'
df = pd.read_csv(url, parse_dates=['Month'], index_col='Month')
df.columns = ['Passengers']

# Display first few rows
print(df.head())
print(f"Dataset shape: {df.shape}")

This dataset contains monthly totals from 1949 to 1960—perfect for demonstrating ARIMA since it has clear trends.

Testing for Stationarity

ARIMA requires stationary data, meaning the statistical properties (mean, variance) remain constant over time. Non-stationary data produces unreliable forecasts. The Augmented Dickey-Fuller (ADF) test determines stationarity statistically.

First, visualize your data:

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Passengers'])
plt.title('Airline Passengers Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.grid(True)
plt.show()

Now run the ADF test:

def adf_test(series, title=''):
    """
    Perform Augmented Dickey-Fuller test
    """
    result = adfuller(series.dropna())
    print(f'ADF Test: {title}')
    print(f'ADF Statistic: {result[0]:.6f}')
    print(f'p-value: {result[1]:.6f}')
    print(f'Critical Values:')
    for key, value in result[4].items():
        print(f'   {key}: {value:.3f}')
    
    if result[1] <= 0.05:
        print("Result: Series is stationary\n")
    else:
        print("Result: Series is non-stationary\n")
    
    return result[1]

# Test original series
adf_test(df['Passengers'], 'Original Series')

If the p-value exceeds 0.05, your data is non-stationary. Apply differencing to remove trends:

# First-order differencing
df['Passengers_diff'] = df['Passengers'].diff()

# Test differenced series
adf_test(df['Passengers_diff'].dropna(), 'First Differenced Series')

# Visualize differenced data
plt.figure(figsize=(12, 6))
plt.plot(df.index[1:], df['Passengers_diff'].dropna())
plt.title('First Differenced Series')
plt.xlabel('Date')
plt.ylabel('Differenced Passengers')
plt.grid(True)
plt.show()

Most series become stationary after first or second differencing. This differencing order becomes your ’d’ parameter.

Identifying ARIMA Parameters

The hardest part of ARIMA is selecting p, d, and q values. You have two approaches: manual analysis with ACF/PACF plots or automated selection with auto_arima.

Manual Parameter Selection

ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots reveal the correlation structure:

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# ACF plot
plot_acf(df['Passengers_diff'].dropna(), lags=40, ax=axes[0])
axes[0].set_title('Autocorrelation Function')

# PACF plot
plot_pacf(df['Passengers_diff'].dropna(), lags=40, ax=axes[1])
axes[1].set_title('Partial Autocorrelation Function')

plt.tight_layout()
plt.show()

Interpreting these plots:

PACF: Significant lags indicate AR order (p)
ACF: Significant lags indicate MA order (q)

This requires experience. For beginners, use auto_arima:

from pmdarima import auto_arima

# Find optimal parameters
auto_model = auto_arima(
    df['Passengers'],
    start_p=0, start_q=0,
    max_p=5, max_q=5,
    d=None,  # Let it determine d
    seasonal=False,
    stepwise=True,
    suppress_warnings=True,
    error_action='ignore',
    trace=True
)

print(auto_model.summary())

The trace=True parameter shows the model selection process. Auto_arima tests multiple combinations and selects based on AIC (Akaike Information Criterion)—lower is better.

Building and Fitting the Model

Once you have parameters, create and fit the ARIMA model. Split your data first:

# Split data: 80% train, 20% test
train_size = int(len(df) * 0.8)
train, test = df['Passengers'][:train_size], df['Passengers'][train_size:]

print(f"Training set size: {len(train)}")
print(f"Test set size: {len(test)}")

# Build ARIMA model
# Using parameters from auto_arima (example: ARIMA(2,1,2))
model = ARIMA(train, order=(2, 1, 2))
fitted_model = model.fit()

# Display model summary
print(fitted_model.summary())

The summary shows coefficient estimates, standard errors, and statistical tests. Check the coefficients’ p-values—significant terms (p < 0.05) contribute meaningfully to predictions.

Making Predictions and Evaluating Performance

Generate forecasts for the test period and evaluate accuracy:

# Forecast on test set
forecast_steps = len(test)
forecast = fitted_model.forecast(steps=forecast_steps)

# Create forecast dataframe
forecast_df = pd.DataFrame({
    'Actual': test.values,
    'Predicted': forecast
}, index=test.index)

# Calculate error metrics
rmse = np.sqrt(mean_squared_error(test, forecast))
mae = mean_absolute_error(test, forecast)
mape = np.mean(np.abs((test - forecast) / test)) * 100

print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Absolute Percentage Error: {mape:.2f}%")

# Visualize results
plt.figure(figsize=(14, 7))
plt.plot(train.index, train, label='Training Data', color='blue')
plt.plot(test.index, test, label='Actual Test Data', color='green')
plt.plot(test.index, forecast, label='Forecast', color='red', linestyle='--')
plt.fill_between(test.index, 
                  forecast - 1.96*fitted_model.resid.std(),
                  forecast + 1.96*fitted_model.resid.std(),
                  alpha=0.2, color='red')
plt.title('ARIMA Forecast vs Actual')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.legend()
plt.grid(True)
plt.show()

The shaded region represents a 95% confidence interval. Wider intervals indicate greater uncertainty.

Diagnostic Checking

Always validate model assumptions by examining residuals:

# Plot residual diagnostics
fitted_model.plot_diagnostics(figsize=(14, 8))
plt.show()

Good residuals should:

Resemble white noise (no patterns in the residual plot)
Follow a normal distribution (Q-Q plot aligns with the diagonal)
Show no autocorrelation (ACF plot stays within confidence bounds)

If diagnostics fail, reconsider your parameter choices or try alternative models.

Best Practices and Common Pitfalls

Always validate stationarity first. Non-stationary data leads to spurious relationships and unreliable forecasts. Don’t skip the ADF test.

Use appropriate train/test splits. Time series data requires chronological splits—never shuffle. Reserve at least 20% for testing to validate out-of-sample performance.

Don’t overfit. More parameters don’t guarantee better forecasts. Use AIC/BIC for model selection and prefer simpler models when performance is comparable.

Consider alternatives when ARIMA fails:

SARIMA: For seasonal data (weekly, monthly, or yearly patterns)
Prophet: For multiple seasonality and holiday effects
LSTM/GRU: For complex non-linear patterns with sufficient data

Monitor forecast horizon. ARIMA accuracy degrades rapidly beyond 10-20 steps ahead. For long-term forecasts, retrain frequently with new data or use different approaches.

ARIMA remains relevant because it’s interpretable, requires minimal data, and provides statistical guarantees. Master these fundamentals before exploring deep learning alternatives—you’ll find ARIMA solves many real-world forecasting problems with far less complexity.