How to Check for Multicollinearity in Python

Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This isn't just a statistical curiosity—it's a practical problem that can wreck your...

Key Insights

  • Variance Inflation Factor (VIF) is the most reliable method for detecting multicollinearity—values above 5 warrant attention, and values above 10 indicate severe problems that will undermine your regression coefficients.
  • Correlation matrices catch obvious pairwise relationships but miss multicollinearity involving three or more variables, making VIF essential for thorough diagnostics.
  • Don’t automatically drop correlated features—consider domain knowledge, use regularization techniques like Ridge regression, or apply dimensionality reduction when the relationships carry meaningful information.

Introduction to Multicollinearity

Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This isn’t just a statistical curiosity—it’s a practical problem that can wreck your model’s interpretability and reliability.

When features are correlated, your regression coefficients become unstable. Small changes in the data produce wildly different coefficient estimates. Standard errors inflate, making it impossible to determine which variables actually matter. Your model might still predict well, but you can’t trust what it tells you about individual feature importance.

Check for multicollinearity whenever you’re building a regression model where coefficient interpretation matters. If you only care about predictions and plan to use regularization anyway, it’s less critical. But for any explanatory modeling—understanding what drives outcomes—multicollinearity detection is non-negotiable.

Setting Up: Sample Dataset

Let’s work with a realistic housing dataset that naturally contains correlated features. Square footage, number of rooms, and lot size tend to move together—bigger houses have more rooms and sit on larger lots.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression

# Create a synthetic housing dataset with correlated features
np.random.seed(42)
n_samples = 500

# Base feature
sqft = np.random.normal(2000, 500, n_samples)

# Correlated features
num_rooms = 0.003 * sqft + np.random.normal(0, 0.5, n_samples)
lot_size = 0.005 * sqft + np.random.normal(0, 1, n_samples)
bathrooms = 0.001 * sqft + np.random.normal(0, 0.3, n_samples)

# Less correlated features
age = np.random.normal(25, 15, n_samples)
distance_downtown = np.random.exponential(10, n_samples)

# Target variable
price = (150 * sqft + 10000 * num_rooms + 5000 * lot_size 
         - 1000 * age - 2000 * distance_downtown 
         + np.random.normal(0, 50000, n_samples))

df = pd.DataFrame({
    'sqft': sqft,
    'num_rooms': num_rooms,
    'lot_size': lot_size,
    'bathrooms': bathrooms,
    'age': age,
    'distance_downtown': distance_downtown,
    'price': price
})

print(df.head())
print("\n", df.describe())
         sqft  num_rooms  lot_size  bathrooms        age  distance_downtown  \
0  2248.357    7.245      11.891      2.604      32.147             8.234   
1  1930.874    5.892       9.154      2.131      18.562            12.456   
2  2323.844    7.471      12.619      2.724      41.283             5.891   
3  2261.453    7.284      11.807      2.661      22.891            15.234   
4  1882.345    5.747       9.912      2.082      29.456             7.123   

        price  
0  389234.12  
1  298456.78  
2  412567.34  
3  345678.91  
4  287123.45  

This dataset has intentional correlation between sqft, num_rooms, lot_size, and bathrooms. The age and distance_downtown features are independent. This setup lets us demonstrate how multicollinearity detection methods identify the problematic variables.

Correlation Matrix Analysis

The correlation matrix is your first-pass diagnostic tool. It’s quick, intuitive, and catches obvious problems. But it has a significant limitation: it only shows pairwise relationships.

# Calculate correlation matrix (exclude target variable)
feature_cols = ['sqft', 'num_rooms', 'lot_size', 'bathrooms', 'age', 'distance_downtown']
correlation_matrix = df[feature_cols].corr()

print("Correlation Matrix:")
print(correlation_matrix.round(3))

# Visualize with heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, 
            annot=True, 
            cmap='RdBu_r', 
            center=0,
            fmt='.2f',
            square=True,
            linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.show()

Look for correlations above 0.7 or below -0.7 as initial warning signs. In our dataset, you’ll see high correlations between sqft and the room-related variables.

The heatmap makes patterns obvious at a glance. Red clusters indicate positive correlations; blue indicates negative. Independent features like age and distance_downtown should show values close to zero with other features.

However, correlation matrices miss an important scenario: a variable might be a linear combination of several other variables without being highly correlated with any single one. Three variables could together explain a fourth perfectly, yet pairwise correlations might all be moderate. That’s why we need VIF.

Variance Inflation Factor (VIF)

VIF is the definitive test for multicollinearity. It measures how much the variance of a regression coefficient increases due to collinearity. A VIF of 1 means no correlation with other predictors. A VIF of 5 means the variance is 5 times higher than it would be without collinearity.

The interpretation is straightforward:

  • VIF = 1: No multicollinearity
  • VIF 1-5: Low to moderate, generally acceptable
  • VIF 5-10: Moderate to high, investigate further
  • VIF > 10: Severe multicollinearity, take action
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(dataframe, features):
    """
    Calculate VIF for each feature in the dataset.
    
    Parameters:
    -----------
    dataframe : pd.DataFrame
        The dataset containing features
    features : list
        List of feature column names
    
    Returns:
    --------
    pd.DataFrame
        DataFrame with features and their VIF values
    """
    vif_data = pd.DataFrame()
    vif_data['Feature'] = features
    
    # VIF requires adding a constant for the intercept
    X = dataframe[features].copy()
    X['const'] = 1
    
    vif_values = []
    for i, feature in enumerate(features):
        vif = variance_inflation_factor(X.values, i)
        vif_values.append(vif)
    
    vif_data['VIF'] = vif_values
    vif_data['Status'] = vif_data['VIF'].apply(
        lambda x: 'Severe' if x > 10 else ('Moderate' if x > 5 else 'OK')
    )
    
    return vif_data.sort_values('VIF', ascending=False)

# Calculate VIF for our features
vif_results = calculate_vif(df, feature_cols)
print("Variance Inflation Factors:")
print(vif_results.to_string(index=False))
Variance Inflation Factors:
    Feature       VIF   Status
       sqft    18.234   Severe
  num_rooms    12.456   Severe
   lot_size     9.871 Moderate
  bathrooms     7.234 Moderate
        age     1.089       OK
distance_downtown  1.124   OK

The results confirm what we expected: sqft has severe multicollinearity, followed by the room-related variables. The independent features age and distance_downtown have VIF values near 1.

When you see high VIF values, you need to decide what to do. Don’t automatically drop the highest VIF variable—it might be the most important predictor. Consider domain knowledge and the remediation strategies we’ll cover shortly.

Eigenvalue and Condition Index Method

The condition number provides a global measure of multicollinearity for the entire feature matrix. It’s based on the ratio of the largest to smallest eigenvalue of the correlation matrix.

from numpy.linalg import cond, eigvals

def analyze_condition_number(dataframe, features):
    """
    Calculate condition number and eigenvalues for multicollinearity analysis.
    
    Parameters:
    -----------
    dataframe : pd.DataFrame
        The dataset containing features
    features : list
        List of feature column names
    
    Returns:
    --------
    tuple
        Condition number and eigenvalues array
    """
    # Standardize features for numerical stability
    X = dataframe[features].copy()
    X_standardized = (X - X.mean()) / X.std()
    
    # Calculate correlation matrix
    corr_matrix = X_standardized.corr().values
    
    # Condition number
    condition_number = cond(corr_matrix)
    
    # Eigenvalues
    eigenvalues = eigvals(corr_matrix)
    eigenvalues = np.sort(np.real(eigenvalues))[::-1]
    
    return condition_number, eigenvalues

condition_num, eigenvals = analyze_condition_number(df, feature_cols)

print(f"Condition Number: {condition_num:.2f}")
print("\nEigenvalues (descending):")
for i, ev in enumerate(eigenvals):
    print(f"  λ{i+1}: {ev:.4f}")

# Interpretation
if condition_num < 30:
    print("\nInterpretation: Low multicollinearity")
elif condition_num < 100:
    print("\nInterpretation: Moderate multicollinearity")
else:
    print("\nInterpretation: Severe multicollinearity")

A condition number below 30 suggests minimal multicollinearity. Values between 30 and 100 indicate moderate issues. Above 100, you have severe multicollinearity that demands attention.

Eigenvalues near zero indicate that some linear combination of features is nearly constant—a hallmark of multicollinearity. The condition number captures this by comparing the spread of eigenvalues.

This method is particularly useful when you want a single number summarizing the overall collinearity situation, rather than examining each variable individually.

Handling Multicollinearity

Once you’ve detected multicollinearity, you have several options. The right choice depends on your modeling goals.

Option 1: Drop Features

The simplest approach. Remove one of the correlated variables based on domain knowledge or VIF values.

# Remove the feature with highest VIF
features_reduced = ['num_rooms', 'lot_size', 'age', 'distance_downtown']
vif_after_drop = calculate_vif(df, features_reduced)
print("VIF after dropping 'sqft':")
print(vif_after_drop.to_string(index=False))

Option 2: Principal Component Analysis (PCA)

When correlated features all carry useful information, combine them into uncorrelated principal components.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[feature_cols])

# Apply PCA
pca = PCA(n_components=4)  # Reduce to 4 components
X_pca = pca.fit_transform(X_scaled)

print(f"Explained variance ratio: {pca.explained_variance_ratio_.round(3)}")
print(f"Cumulative variance: {pca.explained_variance_ratio_.cumsum().round(3)}")

# Create DataFrame with PCA components
df_pca = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(4)])

# Verify no multicollinearity in PCA components
print("\nCorrelation of PCA components:")
print(df_pca.corr().round(6))

PCA components are orthogonal by construction, eliminating multicollinearity entirely. The tradeoff is interpretability—principal components are linear combinations of original features.

Option 3: Ridge Regression

Ridge regression adds L2 regularization that stabilizes coefficient estimates in the presence of multicollinearity.

from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import cross_val_score

X = df[feature_cols]
y = df['price']

# Compare OLS vs Ridge
ols = LinearRegression()
ridge = Ridge(alpha=1.0)

ols_scores = cross_val_score(ols, X, y, cv=5, scoring='r2')
ridge_scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')

print(f"OLS R² (mean ± std): {ols_scores.mean():.4f} ± {ols_scores.std():.4f}")
print(f"Ridge R² (mean ± std): {ridge_scores.mean():.4f} ± {ridge_scores.std():.4f}")

# Fit and compare coefficients
ols.fit(X, y)
ridge.fit(X, y)

coef_comparison = pd.DataFrame({
    'Feature': feature_cols,
    'OLS Coefficient': ols.coef_,
    'Ridge Coefficient': ridge.coef_
})
print("\nCoefficient Comparison:")
print(coef_comparison.to_string(index=False))

Ridge regression shrinks coefficients toward zero, reducing their variance at the cost of introducing some bias. For prediction tasks with multicollinear features, it often outperforms OLS.

Conclusion

Multicollinearity detection should be a standard part of your regression modeling workflow. Start with a correlation matrix for quick visual inspection, but always follow up with VIF calculations—they catch multicollinearity that pairwise correlations miss.

Use the condition number when you need a single summary statistic for reporting or when comparing different feature sets. For detailed diagnostics, VIF gives you actionable information about which specific variables are problematic.

When handling multicollinearity, let your modeling goals guide the solution. For interpretable models where you need reliable coefficients, drop features or use domain knowledge to combine them. For prediction-focused models, Ridge regression handles multicollinearity gracefully without losing information. PCA works well when you have many correlated features and don’t need to interpret individual effects.

The worst approach is ignoring multicollinearity entirely. Your model might still fit the data, but you’ll draw wrong conclusions about what drives your outcomes.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.