How to Check for Multicollinearity in Python
Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This isn't just a statistical curiosity—it's a practical problem that can wreck your...
Key Insights
- Variance Inflation Factor (VIF) is the most reliable method for detecting multicollinearity—values above 5 warrant attention, and values above 10 indicate severe problems that will undermine your regression coefficients.
- Correlation matrices catch obvious pairwise relationships but miss multicollinearity involving three or more variables, making VIF essential for thorough diagnostics.
- Don’t automatically drop correlated features—consider domain knowledge, use regularization techniques like Ridge regression, or apply dimensionality reduction when the relationships carry meaningful information.
Introduction to Multicollinearity
Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This isn’t just a statistical curiosity—it’s a practical problem that can wreck your model’s interpretability and reliability.
When features are correlated, your regression coefficients become unstable. Small changes in the data produce wildly different coefficient estimates. Standard errors inflate, making it impossible to determine which variables actually matter. Your model might still predict well, but you can’t trust what it tells you about individual feature importance.
Check for multicollinearity whenever you’re building a regression model where coefficient interpretation matters. If you only care about predictions and plan to use regularization anyway, it’s less critical. But for any explanatory modeling—understanding what drives outcomes—multicollinearity detection is non-negotiable.
Setting Up: Sample Dataset
Let’s work with a realistic housing dataset that naturally contains correlated features. Square footage, number of rooms, and lot size tend to move together—bigger houses have more rooms and sit on larger lots.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
# Create a synthetic housing dataset with correlated features
np.random.seed(42)
n_samples = 500
# Base feature
sqft = np.random.normal(2000, 500, n_samples)
# Correlated features
num_rooms = 0.003 * sqft + np.random.normal(0, 0.5, n_samples)
lot_size = 0.005 * sqft + np.random.normal(0, 1, n_samples)
bathrooms = 0.001 * sqft + np.random.normal(0, 0.3, n_samples)
# Less correlated features
age = np.random.normal(25, 15, n_samples)
distance_downtown = np.random.exponential(10, n_samples)
# Target variable
price = (150 * sqft + 10000 * num_rooms + 5000 * lot_size
- 1000 * age - 2000 * distance_downtown
+ np.random.normal(0, 50000, n_samples))
df = pd.DataFrame({
'sqft': sqft,
'num_rooms': num_rooms,
'lot_size': lot_size,
'bathrooms': bathrooms,
'age': age,
'distance_downtown': distance_downtown,
'price': price
})
print(df.head())
print("\n", df.describe())
sqft num_rooms lot_size bathrooms age distance_downtown \
0 2248.357 7.245 11.891 2.604 32.147 8.234
1 1930.874 5.892 9.154 2.131 18.562 12.456
2 2323.844 7.471 12.619 2.724 41.283 5.891
3 2261.453 7.284 11.807 2.661 22.891 15.234
4 1882.345 5.747 9.912 2.082 29.456 7.123
price
0 389234.12
1 298456.78
2 412567.34
3 345678.91
4 287123.45
This dataset has intentional correlation between sqft, num_rooms, lot_size, and bathrooms. The age and distance_downtown features are independent. This setup lets us demonstrate how multicollinearity detection methods identify the problematic variables.
Correlation Matrix Analysis
The correlation matrix is your first-pass diagnostic tool. It’s quick, intuitive, and catches obvious problems. But it has a significant limitation: it only shows pairwise relationships.
# Calculate correlation matrix (exclude target variable)
feature_cols = ['sqft', 'num_rooms', 'lot_size', 'bathrooms', 'age', 'distance_downtown']
correlation_matrix = df[feature_cols].corr()
print("Correlation Matrix:")
print(correlation_matrix.round(3))
# Visualize with heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix,
annot=True,
cmap='RdBu_r',
center=0,
fmt='.2f',
square=True,
linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.show()
Look for correlations above 0.7 or below -0.7 as initial warning signs. In our dataset, you’ll see high correlations between sqft and the room-related variables.
The heatmap makes patterns obvious at a glance. Red clusters indicate positive correlations; blue indicates negative. Independent features like age and distance_downtown should show values close to zero with other features.
However, correlation matrices miss an important scenario: a variable might be a linear combination of several other variables without being highly correlated with any single one. Three variables could together explain a fourth perfectly, yet pairwise correlations might all be moderate. That’s why we need VIF.
Variance Inflation Factor (VIF)
VIF is the definitive test for multicollinearity. It measures how much the variance of a regression coefficient increases due to collinearity. A VIF of 1 means no correlation with other predictors. A VIF of 5 means the variance is 5 times higher than it would be without collinearity.
The interpretation is straightforward:
- VIF = 1: No multicollinearity
- VIF 1-5: Low to moderate, generally acceptable
- VIF 5-10: Moderate to high, investigate further
- VIF > 10: Severe multicollinearity, take action
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(dataframe, features):
"""
Calculate VIF for each feature in the dataset.
Parameters:
-----------
dataframe : pd.DataFrame
The dataset containing features
features : list
List of feature column names
Returns:
--------
pd.DataFrame
DataFrame with features and their VIF values
"""
vif_data = pd.DataFrame()
vif_data['Feature'] = features
# VIF requires adding a constant for the intercept
X = dataframe[features].copy()
X['const'] = 1
vif_values = []
for i, feature in enumerate(features):
vif = variance_inflation_factor(X.values, i)
vif_values.append(vif)
vif_data['VIF'] = vif_values
vif_data['Status'] = vif_data['VIF'].apply(
lambda x: 'Severe' if x > 10 else ('Moderate' if x > 5 else 'OK')
)
return vif_data.sort_values('VIF', ascending=False)
# Calculate VIF for our features
vif_results = calculate_vif(df, feature_cols)
print("Variance Inflation Factors:")
print(vif_results.to_string(index=False))
Variance Inflation Factors:
Feature VIF Status
sqft 18.234 Severe
num_rooms 12.456 Severe
lot_size 9.871 Moderate
bathrooms 7.234 Moderate
age 1.089 OK
distance_downtown 1.124 OK
The results confirm what we expected: sqft has severe multicollinearity, followed by the room-related variables. The independent features age and distance_downtown have VIF values near 1.
When you see high VIF values, you need to decide what to do. Don’t automatically drop the highest VIF variable—it might be the most important predictor. Consider domain knowledge and the remediation strategies we’ll cover shortly.
Eigenvalue and Condition Index Method
The condition number provides a global measure of multicollinearity for the entire feature matrix. It’s based on the ratio of the largest to smallest eigenvalue of the correlation matrix.
from numpy.linalg import cond, eigvals
def analyze_condition_number(dataframe, features):
"""
Calculate condition number and eigenvalues for multicollinearity analysis.
Parameters:
-----------
dataframe : pd.DataFrame
The dataset containing features
features : list
List of feature column names
Returns:
--------
tuple
Condition number and eigenvalues array
"""
# Standardize features for numerical stability
X = dataframe[features].copy()
X_standardized = (X - X.mean()) / X.std()
# Calculate correlation matrix
corr_matrix = X_standardized.corr().values
# Condition number
condition_number = cond(corr_matrix)
# Eigenvalues
eigenvalues = eigvals(corr_matrix)
eigenvalues = np.sort(np.real(eigenvalues))[::-1]
return condition_number, eigenvalues
condition_num, eigenvals = analyze_condition_number(df, feature_cols)
print(f"Condition Number: {condition_num:.2f}")
print("\nEigenvalues (descending):")
for i, ev in enumerate(eigenvals):
print(f" λ{i+1}: {ev:.4f}")
# Interpretation
if condition_num < 30:
print("\nInterpretation: Low multicollinearity")
elif condition_num < 100:
print("\nInterpretation: Moderate multicollinearity")
else:
print("\nInterpretation: Severe multicollinearity")
A condition number below 30 suggests minimal multicollinearity. Values between 30 and 100 indicate moderate issues. Above 100, you have severe multicollinearity that demands attention.
Eigenvalues near zero indicate that some linear combination of features is nearly constant—a hallmark of multicollinearity. The condition number captures this by comparing the spread of eigenvalues.
This method is particularly useful when you want a single number summarizing the overall collinearity situation, rather than examining each variable individually.
Handling Multicollinearity
Once you’ve detected multicollinearity, you have several options. The right choice depends on your modeling goals.
Option 1: Drop Features
The simplest approach. Remove one of the correlated variables based on domain knowledge or VIF values.
# Remove the feature with highest VIF
features_reduced = ['num_rooms', 'lot_size', 'age', 'distance_downtown']
vif_after_drop = calculate_vif(df, features_reduced)
print("VIF after dropping 'sqft':")
print(vif_after_drop.to_string(index=False))
Option 2: Principal Component Analysis (PCA)
When correlated features all carry useful information, combine them into uncorrelated principal components.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[feature_cols])
# Apply PCA
pca = PCA(n_components=4) # Reduce to 4 components
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance ratio: {pca.explained_variance_ratio_.round(3)}")
print(f"Cumulative variance: {pca.explained_variance_ratio_.cumsum().round(3)}")
# Create DataFrame with PCA components
df_pca = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(4)])
# Verify no multicollinearity in PCA components
print("\nCorrelation of PCA components:")
print(df_pca.corr().round(6))
PCA components are orthogonal by construction, eliminating multicollinearity entirely. The tradeoff is interpretability—principal components are linear combinations of original features.
Option 3: Ridge Regression
Ridge regression adds L2 regularization that stabilizes coefficient estimates in the presence of multicollinearity.
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import cross_val_score
X = df[feature_cols]
y = df['price']
# Compare OLS vs Ridge
ols = LinearRegression()
ridge = Ridge(alpha=1.0)
ols_scores = cross_val_score(ols, X, y, cv=5, scoring='r2')
ridge_scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')
print(f"OLS R² (mean ± std): {ols_scores.mean():.4f} ± {ols_scores.std():.4f}")
print(f"Ridge R² (mean ± std): {ridge_scores.mean():.4f} ± {ridge_scores.std():.4f}")
# Fit and compare coefficients
ols.fit(X, y)
ridge.fit(X, y)
coef_comparison = pd.DataFrame({
'Feature': feature_cols,
'OLS Coefficient': ols.coef_,
'Ridge Coefficient': ridge.coef_
})
print("\nCoefficient Comparison:")
print(coef_comparison.to_string(index=False))
Ridge regression shrinks coefficients toward zero, reducing their variance at the cost of introducing some bias. For prediction tasks with multicollinear features, it often outperforms OLS.
Conclusion
Multicollinearity detection should be a standard part of your regression modeling workflow. Start with a correlation matrix for quick visual inspection, but always follow up with VIF calculations—they catch multicollinearity that pairwise correlations miss.
Use the condition number when you need a single summary statistic for reporting or when comparing different feature sets. For detailed diagnostics, VIF gives you actionable information about which specific variables are problematic.
When handling multicollinearity, let your modeling goals guide the solution. For interpretable models where you need reliable coefficients, drop features or use domain knowledge to combine them. For prediction-focused models, Ridge regression handles multicollinearity gracefully without losing information. PCA works well when you have many correlated features and don’t need to interpret individual effects.
The worst approach is ignoring multicollinearity entirely. Your model might still fit the data, but you’ll draw wrong conclusions about what drives your outcomes.