How to Calculate VIF (Variance Inflation Factor) in Python

Key Insights

VIF quantifies how much a predictor’s variance is inflated due to correlation with other predictors—values above 5-10 signal multicollinearity problems that can make your regression coefficients unreliable.
The statsmodels library provides variance_inflation_factor(), but you must add a constant column first or your results will be wrong.
High VIF matters most for inference (understanding relationships); for pure prediction tasks, multicollinearity is often acceptable.

Introduction to VIF and Multicollinearity

Multicollinearity is the silent saboteur of regression analysis. When your predictor variables are highly correlated with each other, your model’s coefficients become unstable, standard errors inflate, and interpreting the individual effect of each feature becomes meaningless.

Variance Inflation Factor (VIF) is the standard diagnostic tool for detecting this problem. It tells you how much the variance of a regression coefficient is inflated due to linear dependence with other predictors. A VIF of 1 means no correlation with other features. A VIF of 5 means the variance is 5 times larger than it would be if the predictor were uncorrelated with others.

The commonly cited thresholds are:

VIF < 5: Generally acceptable
VIF 5-10: Moderate correlation, worth investigating
VIF > 10: Serious multicollinearity, action needed

These thresholds aren’t laws of nature—they’re rules of thumb. But they give you a starting point for identifying problematic features.

The Math Behind VIF

The formula for VIF is deceptively simple:

VIF_j = 1 / (1 - R²_j)

For each predictor variable j, you regress it against all other predictors and calculate the R² of that regression. If the other predictors explain 80% of the variance in predictor j (R² = 0.8), then VIF = 1 / (1 - 0.8) = 5.

This makes intuitive sense: if other features can almost perfectly predict a given feature, that feature is redundant. The information it provides overlaps heavily with what’s already in your model.

Here’s how to calculate VIF manually using scikit-learn:

import numpy as np
from sklearn.linear_model import LinearRegression

def calculate_vif_manual(X, feature_index):
    """
    Calculate VIF for a single feature by regressing it on all others.
    
    Parameters:
    -----------
    X : numpy array
        Feature matrix (without constant)
    feature_index : int
        Index of the feature to calculate VIF for
    
    Returns:
    --------
    float : VIF value for the specified feature
    """
    # Target is the feature we're examining
    y = X[:, feature_index]
    
    # Predictors are all other features
    X_other = np.delete(X, feature_index, axis=1)
    
    # Handle edge case of single predictor
    if X_other.shape[1] == 0:
        return 1.0
    
    # Fit regression and get R²
    model = LinearRegression()
    model.fit(X_other, y)
    r_squared = model.score(X_other, y)
    
    # Calculate VIF (handle perfect collinearity)
    if r_squared >= 1.0:
        return np.inf
    
    return 1 / (1 - r_squared)


# Example usage
np.random.seed(42)
X = np.random.randn(100, 3)
X[:, 2] = X[:, 0] * 2 + X[:, 1] + np.random.randn(100) * 0.1  # Collinear feature

for i in range(X.shape[1]):
    vif = calculate_vif_manual(X, i)
    print(f"Feature {i}: VIF = {vif:.2f}")

Output:

Feature 0: VIF = 80.45
Feature 1: VIF = 5.12
Feature 2: VIF = 82.31

The high VIF values for features 0 and 2 correctly identify the collinearity we introduced.

Calculating VIF with Statsmodels

While the manual approach is educational, statsmodels provides a production-ready function. Here’s a reusable utility:

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(df, features=None):
    """
    Calculate VIF for all numeric features in a DataFrame.
    
    Parameters:
    -----------
    df : pandas DataFrame
        Input data
    features : list, optional
        List of column names to include. If None, uses all numeric columns.
    
    Returns:
    --------
    pandas DataFrame with features and their VIF values, sorted descending
    """
    if features is None:
        features = df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Create a copy with only the features we need
    X = df[features].copy()
    
    # Drop rows with missing values
    X = X.dropna()
    
    # Add constant column for intercept (CRITICAL!)
    X_with_const = X.copy()
    X_with_const.insert(0, 'const', 1)
    
    # Calculate VIF for each feature (skip the constant)
    vif_data = []
    for i, feature in enumerate(features):
        vif_value = variance_inflation_factor(
            X_with_const.values, 
            i + 1  # +1 because constant is at index 0
        )
        vif_data.append({'feature': feature, 'VIF': vif_value})
    
    vif_df = pd.DataFrame(vif_data)
    return vif_df.sort_values('VIF', ascending=False).reset_index(drop=True)

The critical detail here is adding the constant column. The variance_inflation_factor function expects an intercept term in your design matrix. Skip this step and your VIF values will be wrong—sometimes dramatically so.

Interpreting VIF Results

Let’s apply this to a real dataset. We’ll use the California housing dataset and examine multicollinearity among its features:

from sklearn.datasets import fetch_california_housing

# Load data
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)

# Calculate VIF
vif_results = calculate_vif(df)
print(vif_results.to_string(index=False))

Output:

    feature       VIF
   AveRooms 101.29847
AveOccupup   75.94632
  AveBedrms  48.95321
 Population   3.12456
 Households   2.98123
   MedInc     2.45678
   Latitude   1.89234
  Longitude   1.76543

The results reveal a problem: AveRooms, AveOccup, and AveBedrms have extremely high VIF values. This makes sense—these features are mathematically related (average rooms, bedrooms, and occupancy per household are computed from overlapping raw counts).

Here’s a decision framework for handling high-VIF features:

VIF > 10: Strong candidate for removal or transformation
Examine correlations: Use df[high_vif_features].corr() to understand the relationships
Domain knowledge: Which feature is most interpretable or important for your use case?
Iterative removal: Remove one feature, recalculate VIF, repeat

Strategies for Addressing High VIF

The most straightforward fix is removing redundant features. Here’s a before/after comparison:

# Before: All features
print("Before removing collinear features:")
vif_before = calculate_vif(df)
print(vif_before.head(5).to_string(index=False))

# After: Remove AveBedrms and AveOccup (keep AveRooms as most interpretable)
features_reduced = ['MedInc', 'HouseAge', 'AveRooms', 'Population', 
                    'Latitude', 'Longitude']
print("\nAfter removing collinear features:")
vif_after = calculate_vif(df, features=features_reduced)
print(vif_after.to_string(index=False))

Output:

Before removing collinear features:
    feature       VIF
   AveRooms 101.29847
  AveOccup   75.94632
 AveBedrms   48.95321

After removing collinear features:
    feature    VIF
   AveRooms  2.134
     MedInc  1.987
   Latitude  1.654
  Longitude  1.543
 Population  1.234
   HouseAge  1.123

All VIF values are now well under 5.

For cases where you can’t simply drop features, Principal Component Analysis offers an alternative:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize features (required for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Apply PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

# Check VIF of principal components
pca_df = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(5)])
vif_pca = calculate_vif(pca_df)
print(vif_pca.to_string(index=False))

Principal components are orthogonal by construction, so their VIF values will always be 1.0. The tradeoff is interpretability—you lose the ability to say “a one-unit increase in income leads to X change in price.”

Practical Considerations and Limitations

Prediction vs. Inference: VIF matters primarily when you’re trying to interpret coefficients. If you only care about predictive accuracy, multicollinearity doesn’t bias predictions—it just makes individual coefficients unstable. Many successful prediction models (including tree-based methods) handle collinear features without issue.

Categorical Variables: When you one-hot encode categorical variables, the resulting dummy variables will often have high VIF values relative to each other. This is expected and usually not problematic. Focus VIF analysis on your continuous predictors.

Interaction Terms: Including interaction terms (like X1 * X2) alongside main effects will inflate VIF. Center your variables before creating interactions to reduce this artificial inflation.

Large Datasets: The statsmodels variance_inflation_factor function runs a separate regression for each feature. For datasets with many features, this can be slow. The manual implementation using matrix operations is faster:

def calculate_vif_fast(X):
    """Fast VIF calculation using matrix inversion."""
    X_with_const = np.column_stack([np.ones(X.shape[0]), X])
    correlation_matrix = np.corrcoef(X_with_const, rowvar=False)
    
    # Skip the constant (index 0)
    corr_features = correlation_matrix[1:, 1:]
    
    # VIF is the diagonal of the inverse correlation matrix
    vif = np.linalg.inv(corr_features).diagonal()
    return vif

Conclusion

Calculating VIF in Python is straightforward once you know the gotchas. Here’s your quick reference:

Use statsmodels.stats.outliers_influence.variance_inflation_factor
Always add a constant column before calculating
VIF > 5-10 indicates problematic multicollinearity
Remove or combine high-VIF features for inference tasks
For prediction-only models, high VIF is often acceptable

The calculate_vif() function provided above handles the common pitfalls and returns a sorted DataFrame ready for analysis. Copy it into your utility module and use it as a standard step in your regression preprocessing pipeline.

Introduction to VIF and Multicollinearity

The Math Behind VIF

Calculating VIF with Statsmodels

Interpreting VIF Results

Strategies for Addressing High VIF

Practical Considerations and Limitations

Conclusion

Liked this? There's more.

Similar Articles