How to Perform Multiple Linear Regression in Python
Multiple linear regression is the workhorse of predictive modeling. While simple linear regression models the relationship between one independent variable and a dependent variable, multiple linear...
Key Insights
- Multiple linear regression extends simple regression to model relationships between multiple independent variables and a single dependent variable, making it essential for real-world prediction tasks where outcomes depend on several factors.
- Always check for multicollinearity using Variance Inflation Factor (VIF) before building your model—highly correlated predictors inflate standard errors and make coefficient interpretation unreliable.
- Use scikit-learn for production predictions and statsmodels for statistical inference; they serve different purposes and you’ll often need both in practice.
Introduction to Multiple Linear Regression
Multiple linear regression is the workhorse of predictive modeling. While simple linear regression models the relationship between one independent variable and a dependent variable, multiple linear regression handles the realistic scenario where your outcome depends on several factors simultaneously.
Consider predicting house prices. A home’s value doesn’t depend solely on square footage—it’s influenced by the number of bedrooms, location quality, age of the property, and dozens of other factors. Multiple linear regression captures these relationships in a single model.
The mathematical foundation is straightforward:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:
yis your target variableβ₀is the interceptβ₁, β₂, ..., βₙare coefficients for each featurex₁, x₂, ..., xₙare your independent variablesεis the error term
Each coefficient represents the expected change in y for a one-unit increase in that variable, holding all other variables constant. This “holding constant” interpretation is what makes multiple regression powerful—and what makes multicollinearity dangerous.
Prerequisites and Setup
You’ll need four libraries for a complete multiple regression workflow. Install them if you haven’t:
pip install numpy pandas scikit-learn statsmodels seaborn matplotlib
Let’s set up our environment and create a realistic dataset for predicting house prices:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
# Generate synthetic housing data
n_samples = 500
data = pd.DataFrame({
'sqft': np.random.normal(1800, 400, n_samples),
'bedrooms': np.random.randint(1, 6, n_samples),
'bathrooms': np.random.randint(1, 4, n_samples),
'age': np.random.randint(0, 50, n_samples),
'distance_to_city': np.random.uniform(1, 30, n_samples)
})
# Create target with known relationships
data['price'] = (
50000 +
150 * data['sqft'] +
10000 * data['bedrooms'] +
15000 * data['bathrooms'] -
1000 * data['age'] -
2000 * data['distance_to_city'] +
np.random.normal(0, 20000, n_samples) # noise
)
print(data.head())
print(f"\nDataset shape: {data.shape}")
Data Preparation and Exploration
Before fitting any model, you must check your data for issues that violate regression assumptions or degrade model performance.
Checking for Missing Values
# Check for missing values
print("Missing values per column:")
print(data.isnull().sum())
# If you had missing values, you'd handle them:
# data = data.dropna() # or
# data = data.fillna(data.mean())
Multicollinearity Detection
Multicollinearity occurs when independent variables are highly correlated with each other. This doesn’t affect predictions, but it makes coefficient interpretation meaningless and inflates standard errors.
# Correlation matrix heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=150)
plt.show()
# Calculate Variance Inflation Factor (VIF)
X = data.drop('price', axis=1)
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\nVariance Inflation Factors:")
print(vif_data)
VIF Interpretation:
- VIF = 1: No correlation with other variables
- VIF < 5: Acceptable
- VIF > 5: Moderate multicollinearity, investigate
- VIF > 10: Severe multicollinearity, remove or combine variables
Feature Scaling Considerations
Linear regression coefficients are scale-dependent. If you want to compare feature importance based on coefficient magnitude, standardize your features:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = pd.DataFrame(
scaler.fit_transform(X),
columns=X.columns
)
print("Scaled feature statistics:")
print(X_scaled.describe().round(2))
For prediction purposes, scaling isn’t strictly necessary—the model will still work. But for interpretation and regularized regression, it’s essential.
Building the Model with scikit-learn
scikit-learn provides a clean, production-ready interface for linear regression. Here’s the complete workflow:
# Define features and target
X = data.drop('price', axis=1)
y = data['price']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Display coefficients
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
})
print("\nModel Coefficients:")
print(coefficients)
print(f"\nIntercept: {model.intercept_:.2f}")
The coefficients tell you the expected change in price for each unit increase in that feature. For example, a coefficient of 150 for sqft means each additional square foot adds $150 to the predicted price.
Statistical Analysis with statsmodels
scikit-learn is great for predictions, but it doesn’t give you p-values, confidence intervals, or other statistical inference tools. That’s where statsmodels comes in:
# Add constant for intercept (statsmodels doesn't add it automatically)
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)
# Fit OLS model
ols_model = sm.OLS(y_train, X_train_sm).fit()
# Print detailed summary
print(ols_model.summary())
The summary output includes:
- R-squared: Proportion of variance explained (0 to 1)
- Adj. R-squared: R² adjusted for number of predictors
- F-statistic: Tests if the model is statistically significant overall
- Coefficients (coef): Same as sklearn’s coefficients
- Std err: Standard error of each coefficient
- t-statistic: Coefficient divided by standard error
- P>|t|: p-value for hypothesis test that coefficient equals zero
- [0.025, 0.975]: 95% confidence interval for each coefficient
Interpreting p-values: A p-value less than 0.05 suggests the variable has a statistically significant relationship with the target. But don’t blindly remove variables with high p-values—domain knowledge matters more than arbitrary thresholds.
Model Evaluation and Diagnostics
Performance Metrics
# Calculate metrics
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("Model Performance:")
print(f"Training R²: {train_r2:.4f}")
print(f"Testing R²: {test_r2:.4f}")
print(f"Training RMSE: ${train_rmse:,.2f}")
print(f"Testing RMSE: ${test_rmse:,.2f}")
# Check for overfitting
if train_r2 - test_r2 > 0.1:
print("\nWarning: Possible overfitting detected")
Residual Analysis
Residuals should be normally distributed with constant variance. Patterns in residual plots indicate model problems:
# Calculate residuals
residuals = y_test - y_test_pred
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Residuals vs Predicted
axes[0].scatter(y_test_pred, residuals, alpha=0.5)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Predicted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Predicted')
# Actual vs Predicted
axes[1].scatter(y_test, y_test_pred, alpha=0.5)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[1].set_xlabel('Actual Values')
axes[1].set_ylabel('Predicted Values')
axes[1].set_title('Actual vs Predicted')
# Residual histogram
axes[2].hist(residuals, bins=30, edgecolor='black')
axes[2].set_xlabel('Residual Value')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Residual Distribution')
plt.tight_layout()
plt.savefig('residual_analysis.png', dpi=150)
plt.show()
What to look for:
- Residuals vs. Predicted: Should show random scatter. A funnel shape indicates heteroscedasticity.
- Actual vs. Predicted: Points should cluster around the diagonal line.
- Residual histogram: Should approximate a normal distribution.
Conclusion and Best Practices
Multiple linear regression remains one of the most interpretable and practical modeling techniques available. Use it when you need to understand relationships between variables, not just make predictions.
When to use multiple linear regression:
- You need interpretable coefficients
- Relationships are approximately linear
- You have continuous or properly encoded categorical predictors
- Stakeholders need to understand what drives predictions
When to consider alternatives:
- Non-linear relationships exist (try polynomial features or tree-based models)
- High dimensionality with many correlated features (use regularization: Ridge, Lasso)
- Classification problems (use logistic regression)
Common pitfalls to avoid:
- Ignoring multicollinearity—always check VIF before interpreting coefficients
- Extrapolating beyond your data range—the model only knows what it’s seen
- Confusing correlation with causation—regression shows association, not causality
- Overfitting with too many features—use adjusted R² and cross-validation
- Ignoring residual patterns—they reveal model inadequacies
Start with exploratory analysis, check your assumptions, build the model, and validate thoroughly. Multiple regression isn’t glamorous, but it’s reliable, interpretable, and often good enough.