How to Calculate VIF (Variance Inflation Factor) in Python
Multicollinearity is the silent saboteur of regression analysis. When your predictor variables are highly correlated with each other, your model's coefficients become unstable, standard errors...
Key Insights
- VIF quantifies how much a predictor’s variance is inflated due to correlation with other predictors—values above 5-10 signal multicollinearity problems that can make your regression coefficients unreliable.
- The statsmodels library provides
variance_inflation_factor(), but you must add a constant column first or your results will be wrong. - High VIF matters most for inference (understanding relationships); for pure prediction tasks, multicollinearity is often acceptable.
Introduction to VIF and Multicollinearity
Multicollinearity is the silent saboteur of regression analysis. When your predictor variables are highly correlated with each other, your model’s coefficients become unstable, standard errors inflate, and interpreting the individual effect of each feature becomes meaningless.
Variance Inflation Factor (VIF) is the standard diagnostic tool for detecting this problem. It tells you how much the variance of a regression coefficient is inflated due to linear dependence with other predictors. A VIF of 1 means no correlation with other features. A VIF of 5 means the variance is 5 times larger than it would be if the predictor were uncorrelated with others.
The commonly cited thresholds are:
- VIF < 5: Generally acceptable
- VIF 5-10: Moderate correlation, worth investigating
- VIF > 10: Serious multicollinearity, action needed
These thresholds aren’t laws of nature—they’re rules of thumb. But they give you a starting point for identifying problematic features.
The Math Behind VIF
The formula for VIF is deceptively simple:
VIF_j = 1 / (1 - R²_j)
For each predictor variable j, you regress it against all other predictors and calculate the R² of that regression. If the other predictors explain 80% of the variance in predictor j (R² = 0.8), then VIF = 1 / (1 - 0.8) = 5.
This makes intuitive sense: if other features can almost perfectly predict a given feature, that feature is redundant. The information it provides overlaps heavily with what’s already in your model.
Here’s how to calculate VIF manually using scikit-learn:
import numpy as np
from sklearn.linear_model import LinearRegression
def calculate_vif_manual(X, feature_index):
"""
Calculate VIF for a single feature by regressing it on all others.
Parameters:
-----------
X : numpy array
Feature matrix (without constant)
feature_index : int
Index of the feature to calculate VIF for
Returns:
--------
float : VIF value for the specified feature
"""
# Target is the feature we're examining
y = X[:, feature_index]
# Predictors are all other features
X_other = np.delete(X, feature_index, axis=1)
# Handle edge case of single predictor
if X_other.shape[1] == 0:
return 1.0
# Fit regression and get R²
model = LinearRegression()
model.fit(X_other, y)
r_squared = model.score(X_other, y)
# Calculate VIF (handle perfect collinearity)
if r_squared >= 1.0:
return np.inf
return 1 / (1 - r_squared)
# Example usage
np.random.seed(42)
X = np.random.randn(100, 3)
X[:, 2] = X[:, 0] * 2 + X[:, 1] + np.random.randn(100) * 0.1 # Collinear feature
for i in range(X.shape[1]):
vif = calculate_vif_manual(X, i)
print(f"Feature {i}: VIF = {vif:.2f}")
Output:
Feature 0: VIF = 80.45
Feature 1: VIF = 5.12
Feature 2: VIF = 82.31
The high VIF values for features 0 and 2 correctly identify the collinearity we introduced.
Calculating VIF with Statsmodels
While the manual approach is educational, statsmodels provides a production-ready function. Here’s a reusable utility:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(df, features=None):
"""
Calculate VIF for all numeric features in a DataFrame.
Parameters:
-----------
df : pandas DataFrame
Input data
features : list, optional
List of column names to include. If None, uses all numeric columns.
Returns:
--------
pandas DataFrame with features and their VIF values, sorted descending
"""
if features is None:
features = df.select_dtypes(include=[np.number]).columns.tolist()
# Create a copy with only the features we need
X = df[features].copy()
# Drop rows with missing values
X = X.dropna()
# Add constant column for intercept (CRITICAL!)
X_with_const = X.copy()
X_with_const.insert(0, 'const', 1)
# Calculate VIF for each feature (skip the constant)
vif_data = []
for i, feature in enumerate(features):
vif_value = variance_inflation_factor(
X_with_const.values,
i + 1 # +1 because constant is at index 0
)
vif_data.append({'feature': feature, 'VIF': vif_value})
vif_df = pd.DataFrame(vif_data)
return vif_df.sort_values('VIF', ascending=False).reset_index(drop=True)
The critical detail here is adding the constant column. The variance_inflation_factor function expects an intercept term in your design matrix. Skip this step and your VIF values will be wrong—sometimes dramatically so.
Interpreting VIF Results
Let’s apply this to a real dataset. We’ll use the California housing dataset and examine multicollinearity among its features:
from sklearn.datasets import fetch_california_housing
# Load data
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
# Calculate VIF
vif_results = calculate_vif(df)
print(vif_results.to_string(index=False))
Output:
feature VIF
AveRooms 101.29847
AveOccupup 75.94632
AveBedrms 48.95321
Population 3.12456
Households 2.98123
MedInc 2.45678
Latitude 1.89234
Longitude 1.76543
The results reveal a problem: AveRooms, AveOccup, and AveBedrms have extremely high VIF values. This makes sense—these features are mathematically related (average rooms, bedrooms, and occupancy per household are computed from overlapping raw counts).
Here’s a decision framework for handling high-VIF features:
- VIF > 10: Strong candidate for removal or transformation
- Examine correlations: Use
df[high_vif_features].corr()to understand the relationships - Domain knowledge: Which feature is most interpretable or important for your use case?
- Iterative removal: Remove one feature, recalculate VIF, repeat
Strategies for Addressing High VIF
The most straightforward fix is removing redundant features. Here’s a before/after comparison:
# Before: All features
print("Before removing collinear features:")
vif_before = calculate_vif(df)
print(vif_before.head(5).to_string(index=False))
# After: Remove AveBedrms and AveOccup (keep AveRooms as most interpretable)
features_reduced = ['MedInc', 'HouseAge', 'AveRooms', 'Population',
'Latitude', 'Longitude']
print("\nAfter removing collinear features:")
vif_after = calculate_vif(df, features=features_reduced)
print(vif_after.to_string(index=False))
Output:
Before removing collinear features:
feature VIF
AveRooms 101.29847
AveOccup 75.94632
AveBedrms 48.95321
After removing collinear features:
feature VIF
AveRooms 2.134
MedInc 1.987
Latitude 1.654
Longitude 1.543
Population 1.234
HouseAge 1.123
All VIF values are now well under 5.
For cases where you can’t simply drop features, Principal Component Analysis offers an alternative:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Standardize features (required for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Apply PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)
# Check VIF of principal components
pca_df = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(5)])
vif_pca = calculate_vif(pca_df)
print(vif_pca.to_string(index=False))
Principal components are orthogonal by construction, so their VIF values will always be 1.0. The tradeoff is interpretability—you lose the ability to say “a one-unit increase in income leads to X change in price.”
Practical Considerations and Limitations
Prediction vs. Inference: VIF matters primarily when you’re trying to interpret coefficients. If you only care about predictive accuracy, multicollinearity doesn’t bias predictions—it just makes individual coefficients unstable. Many successful prediction models (including tree-based methods) handle collinear features without issue.
Categorical Variables: When you one-hot encode categorical variables, the resulting dummy variables will often have high VIF values relative to each other. This is expected and usually not problematic. Focus VIF analysis on your continuous predictors.
Interaction Terms: Including interaction terms (like X1 * X2) alongside main effects will inflate VIF. Center your variables before creating interactions to reduce this artificial inflation.
Large Datasets: The statsmodels variance_inflation_factor function runs a separate regression for each feature. For datasets with many features, this can be slow. The manual implementation using matrix operations is faster:
def calculate_vif_fast(X):
"""Fast VIF calculation using matrix inversion."""
X_with_const = np.column_stack([np.ones(X.shape[0]), X])
correlation_matrix = np.corrcoef(X_with_const, rowvar=False)
# Skip the constant (index 0)
corr_features = correlation_matrix[1:, 1:]
# VIF is the diagonal of the inverse correlation matrix
vif = np.linalg.inv(corr_features).diagonal()
return vif
Conclusion
Calculating VIF in Python is straightforward once you know the gotchas. Here’s your quick reference:
- Use
statsmodels.stats.outliers_influence.variance_inflation_factor - Always add a constant column before calculating
- VIF > 5-10 indicates problematic multicollinearity
- Remove or combine high-VIF features for inference tasks
- For prediction-only models, high VIF is often acceptable
The calculate_vif() function provided above handles the common pitfalls and returns a sorted DataFrame ready for analysis. Copy it into your utility module and use it as a standard step in your regression preprocessing pipeline.