How to Calculate AIC and BIC in Python
Model selection is one of the most consequential decisions in statistical modeling. Add too few predictors and you underfit, missing important patterns. Add too many and you overfit, capturing noise...
Key Insights
- AIC and BIC both measure model quality by balancing goodness-of-fit against complexity, but BIC applies a stronger penalty that grows with sample size, making it more conservative for large datasets.
- Statsmodels provides built-in
.aicand.bicattributes on fitted models, eliminating the need for manual calculation in most regression and time series workflows. - Lower values indicate better models, but differences less than 2 are generally considered negligible—don’t chase marginal improvements at the expense of interpretability.
Introduction to Model Selection Criteria
Model selection is one of the most consequential decisions in statistical modeling. Add too few predictors and you underfit, missing important patterns. Add too many and you overfit, capturing noise that won’t generalize. AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) provide principled ways to navigate this tradeoff.
Both criteria measure how well a model fits the data while penalizing complexity. The core idea is simple: a model that explains the data well but uses fewer parameters is preferable to one that explains it slightly better but requires many more parameters. This penalty for complexity is what separates information criteria from raw measures like R-squared, which will always improve (or stay the same) as you add variables.
In practice, you’ll use these criteria to compare candidate models and select the one that best balances fit and parsimony. They’re particularly valuable when you can’t rely on nested model comparisons or when you need an automated selection procedure.
Mathematical Foundations
Understanding the formulas helps you reason about when each criterion is appropriate.
AIC is defined as:
AIC = 2k - 2ln(L)
Where k is the number of estimated parameters and L is the maximized likelihood of the model.
BIC (also called the Schwarz criterion) is defined as:
BIC = k·ln(n) - 2ln(L)
Where n is the sample size, in addition to k and L from above.
The critical difference lies in the penalty term. AIC uses a fixed penalty of 2 per parameter, while BIC’s penalty scales with the logarithm of the sample size. For any dataset with more than 7 observations (since ln(8) ≈ 2.08), BIC penalizes additional parameters more heavily than AIC.
This has practical implications: BIC tends to select simpler models, especially as sample size grows. AIC is more permissive with complexity. Neither is universally “better”—they answer slightly different questions. AIC focuses on predictive accuracy, while BIC aims to identify the true model (assuming it exists in your candidate set).
Calculating AIC/BIC with Statsmodels
For most regression work in Python, statsmodels handles the calculation automatically. Here’s how to fit models and access their information criteria:
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Generate sample data
np.random.seed(42)
n = 200
x1 = np.random.normal(0, 1, n)
x2 = np.random.normal(0, 1, n)
x3 = np.random.normal(0, 1, n) # Noise variable
y = 3 + 2*x1 - 1.5*x2 + np.random.normal(0, 1, n)
# Create DataFrames for different model specifications
df = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})
# Model 1: Only x1
X1 = sm.add_constant(df[['x1']])
model1 = sm.OLS(df['y'], X1).fit()
# Model 2: x1 and x2 (true model)
X2 = sm.add_constant(df[['x1', 'x2']])
model2 = sm.OLS(df['y'], X2).fit()
# Model 3: All variables including noise
X3 = sm.add_constant(df[['x1', 'x2', 'x3']])
model3 = sm.OLS(df['y'], X3).fit()
# Compare information criteria
print("Model 1 (x1 only):")
print(f" AIC: {model1.aic:.2f}, BIC: {model1.bic:.2f}")
print("\nModel 2 (x1, x2):")
print(f" AIC: {model2.aic:.2f}, BIC: {model2.bic:.2f}")
print("\nModel 3 (x1, x2, x3):")
print(f" AIC: {model3.aic:.2f}, BIC: {model3.bic:.2f}")
Running this produces output showing that Model 2 (the true data-generating process) has the lowest AIC and BIC. The noise variable in Model 3 increases both criteria despite slightly improving the raw fit.
AIC/BIC for Time Series Models
Time series model selection is where information criteria really shine. Choosing ARIMA orders manually is tedious; AIC and BIC automate the process.
import warnings
from statsmodels.tsa.arima.model import ARIMA
# Generate sample time series data
np.random.seed(123)
n_obs = 300
errors = np.random.normal(0, 1, n_obs)
y_ts = np.zeros(n_obs)
# AR(2) process
for t in range(2, n_obs):
y_ts[t] = 0.5*y_ts[t-1] + 0.3*y_ts[t-2] + errors[t]
# Compare different ARIMA specifications
orders = [(1, 0, 0), (2, 0, 0), (1, 0, 1), (2, 0, 1), (2, 0, 2)]
results = []
warnings.filterwarnings('ignore')
for order in orders:
try:
model = ARIMA(y_ts, order=order).fit()
results.append({
'Order': f'ARIMA{order}',
'AIC': model.aic,
'BIC': model.bic,
'Log-Likelihood': model.llf
})
except Exception as e:
print(f"Failed to fit ARIMA{order}: {e}")
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('BIC').reset_index(drop=True)
print(results_df.to_string(index=False))
The true model is AR(2), so we expect ARIMA(2,0,0) to perform well. Both AIC and BIC should favor this specification, though more complex models might achieve marginally better AIC values.
For automated selection, consider pmdarima:
from pmdarima import auto_arima
# Automatic model selection
auto_model = auto_arima(
y_ts,
start_p=0, max_p=3,
start_q=0, max_q=3,
d=0,
information_criterion='bic',
trace=True,
suppress_warnings=True
)
print(f"\nBest model: {auto_model.order}")
print(f"AIC: {auto_model.aic():.2f}, BIC: {auto_model.bic():.2f}")
Manual Calculation from Scratch
Sometimes you need to compute these criteria yourself—perhaps for a custom model or to verify library output. Here’s a straightforward implementation:
def calculate_aic(log_likelihood, k):
"""
Calculate Akaike Information Criterion.
Parameters:
-----------
log_likelihood : float
Maximized log-likelihood of the model
k : int
Number of estimated parameters (including intercept and variance)
Returns:
--------
float : AIC value
"""
return 2 * k - 2 * log_likelihood
def calculate_bic(log_likelihood, k, n):
"""
Calculate Bayesian Information Criterion.
Parameters:
-----------
log_likelihood : float
Maximized log-likelihood of the model
k : int
Number of estimated parameters
n : int
Number of observations
Returns:
--------
float : BIC value
"""
return k * np.log(n) - 2 * log_likelihood
def calculate_aic_corrected(log_likelihood, k, n):
"""
Calculate corrected AIC (AICc) for small samples.
Use when n/k < 40.
"""
aic = calculate_aic(log_likelihood, k)
correction = (2 * k * (k + 1)) / (n - k - 1)
return aic + correction
# Verify against statsmodels
# Note: statsmodels counts parameters including error variance
k = len(model2.params) + 1 # +1 for error variance
n = model2.nobs
ll = model2.llf
manual_aic = calculate_aic(ll, k)
manual_bic = calculate_bic(ll, k, n)
print(f"Manual AIC: {manual_aic:.2f}, Statsmodels AIC: {model2.aic:.2f}")
print(f"Manual BIC: {manual_bic:.2f}, Statsmodels BIC: {model2.bic:.2f}")
A note on parameter counting: different libraries count parameters differently. Some include the error variance, others don’t. Always verify your manual calculations against library output to understand the convention being used.
Comparing Multiple Models
In practice, you’ll often compare many candidate models. Here’s a reusable pattern for building comparison tables:
def compare_models(models_dict, n_obs):
"""
Compare multiple fitted models using information criteria.
Parameters:
-----------
models_dict : dict
Dictionary mapping model names to fitted statsmodels results
n_obs : int
Number of observations (for verification)
Returns:
--------
DataFrame with model comparison metrics
"""
comparison = []
for name, model in models_dict.items():
comparison.append({
'Model': name,
'Num_Params': len(model.params),
'Log_Likelihood': model.llf,
'AIC': model.aic,
'BIC': model.bic,
'R_squared': model.rsquared
})
df = pd.DataFrame(comparison)
# Calculate delta values (difference from best)
df['Delta_AIC'] = df['AIC'] - df['AIC'].min()
df['Delta_BIC'] = df['BIC'] - df['BIC'].min()
# Sort by BIC (or AIC, depending on preference)
df = df.sort_values('BIC').reset_index(drop=True)
return df
# Build comparison
models = {
'Simple (x1)': model1,
'True (x1, x2)': model2,
'Overfit (x1, x2, x3)': model3
}
comparison_df = compare_models(models, n)
print(comparison_df.to_string(index=False))
Interpreting delta values follows established rules of thumb:
- Delta < 2: Models are essentially equivalent; no strong preference
- Delta 2-6: Moderate evidence against the higher-scoring model
- Delta > 10: Strong evidence; the higher-scoring model has essentially no support
Best Practices and Limitations
When to prefer AIC: Use AIC when your goal is prediction and you’re willing to accept slightly more complex models for marginal accuracy gains. AIC is also preferred when the true model isn’t necessarily in your candidate set—it aims for the best approximation rather than truth recovery.
When to prefer BIC: Use BIC when you want to identify the true underlying model (assuming it exists in your candidates) or when you have large samples and want to avoid overfitting. BIC’s consistency property means it will select the true model with probability approaching 1 as sample size increases.
Common pitfalls to avoid:
-
Comparing models on different data. AIC and BIC values are only comparable when models are fit to identical datasets. Different subsets, different transformations, or different handling of missing values invalidates comparisons.
-
Ignoring the likelihood function. You can only compare models using the same distributional assumptions. Comparing AIC from a linear regression to AIC from a Poisson regression is meaningless.
-
Chasing small differences. A model with AIC of 500.1 is not meaningfully better than one with AIC of 500.5. Use the delta thresholds and exercise judgment.
-
Forgetting domain knowledge. Information criteria are tools, not oracles. If a simpler model makes theoretical sense and has only marginally worse AIC, prefer interpretability.
-
Using raw AIC/BIC for small samples. When your sample size is small relative to the number of parameters (n/k < 40), use the corrected AICc instead of standard AIC.
Information criteria are powerful tools for model selection, but they work best as part of a broader workflow that includes cross-validation, residual diagnostics, and domain expertise. Use them to narrow your candidates, then apply judgment to make the final call.