How to Implement Multiple Linear Regression in Python

Key Insights

Multiple linear regression extends simple linear regression to handle multiple independent variables simultaneously, making it essential for real-world prediction problems where outcomes depend on several factors
Feature correlation analysis and proper train-test splitting are critical preprocessing steps that directly impact model accuracy and prevent overfitting
While scikit-learn makes implementation straightforward, understanding evaluation metrics (R², RMSE, MAE) and residual analysis is necessary to assess whether your model actually fits the data well

Introduction to Multiple Linear Regression

Multiple linear regression (MLR) is the workhorse of predictive modeling. Unlike simple linear regression that uses one independent variable, MLR handles multiple predictors simultaneously. The mathematical relationship is expressed as:

Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε

Where Y is the target variable, X₁ through Xₙ are independent variables, β₀ is the intercept, β₁ through βₙ are coefficients, and ε represents error.

This becomes immediately practical when you’re predicting house prices (using square footage, bedrooms, location, age), forecasting sales (based on advertising spend, seasonality, competitor pricing), or estimating employee salaries (considering experience, education, department). Any scenario where multiple factors influence an outcome is a candidate for MLR.

Setting Up Your Environment

You’ll need four core libraries. NumPy handles numerical operations, pandas manages data structures, scikit-learn provides the regression algorithms, and matplotlib creates visualizations.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
# Using a housing dataset as an example
df = pd.read_csv('housing_data.csv')

# Quick look at the data
print(df.head())
print(df.info())
print(df.describe())

For this article, assume we’re working with housing data containing features like square footage, number of bedrooms, bathrooms, age of the house, and distance to city center, with price as our target variable.

Data Preparation and Exploration

Raw data is never ready for modeling. You need to handle missing values, identify relevant features, and understand relationships between variables.

# Check for missing values
print(df.isnull().sum())

# Handle missing values - either drop or impute
df = df.dropna()  # Simple approach
# Or use imputation: df.fillna(df.mean(), inplace=True)

# Select features and target
features = ['square_feet', 'bedrooms', 'bathrooms', 'age', 'distance_to_center']
X = df[features]
y = df['price']

# Examine correlations
correlation_matrix = df[features + ['price']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png')
plt.close()

# Check for multicollinearity
print("\nCorrelation with target variable:")
print(correlation_matrix['price'].sort_values(ascending=False))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\nTraining set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

The correlation heatmap reveals which features strongly influence price and whether features correlate with each other (multicollinearity). High correlation between independent variables can destabilize coefficient estimates, though it won’t necessarily hurt predictions.

The 80-20 train-test split is standard. The random_state parameter ensures reproducibility. Never train and test on the same data—that’s like studying with the exam answers in front of you.

Building the Model with Scikit-Learn

Scikit-learn makes MLR implementation remarkably clean. The LinearRegression class uses ordinary least squares behind the scenes.

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Display model parameters
print("\nModel Coefficients:")
for feature, coef in zip(features, model.coef_):
    print(f"{feature}: {coef:.2f}")
print(f"\nIntercept: {model.intercept_:.2f}")

The coefficients tell you how much the target variable changes for each unit increase in a feature, holding other features constant. A coefficient of 150 for square_feet means each additional square foot adds $150 to the predicted price, all else equal.

The intercept represents the baseline prediction when all features equal zero—often not meaningful in practice but mathematically necessary.

Evaluating Model Performance

Metrics quantify how well your model performs. Don’t rely on just one.

# Calculate evaluation metrics
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

print("\nModel Performance:")
print(f"Training R²: {train_r2:.4f}")
print(f"Testing R²: {test_r2:.4f}")
print(f"\nTraining RMSE: ${train_rmse:,.2f}")
print(f"Testing RMSE: ${test_rmse:,.2f}")
print(f"\nTraining MAE: ${train_mae:,.2f}")
print(f"Testing MAE: ${test_mae:,.2f}")

# Residual analysis
residuals = y_test - y_test_pred
print(f"\nMean of residuals: {np.mean(residuals):.2f}")
print(f"Std of residuals: {np.std(residuals):.2f}")

R² (coefficient of determination) ranges from 0 to 1, indicating the proportion of variance explained. An R² of 0.85 means your model explains 85% of price variation. If training R² significantly exceeds testing R², you’re overfitting.

RMSE (root mean squared error) penalizes large errors more than small ones. MAE (mean absolute error) treats all errors equally. Use RMSE when large errors are particularly costly, MAE when you want a more intuitive average error metric.

Residuals should be randomly distributed around zero. Patterns indicate model misspecification—perhaps you need non-linear terms or additional features.

Visualizing Results

Visualizations expose what metrics might hide.

# Actual vs Predicted plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_test_pred, alpha=0.6, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Price ($)')
plt.ylabel('Predicted Price ($)')
plt.title('Actual vs Predicted House Prices')
plt.legend()
plt.tight_layout()
plt.savefig('actual_vs_predicted.png')
plt.close()

# Residual plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test_pred, residuals, alpha=0.6, edgecolors='k')
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted Price ($)')
plt.ylabel('Residuals ($)')
plt.title('Residual Plot')
plt.tight_layout()
plt.savefig('residual_plot.png')
plt.close()

# Residual distribution
plt.figure(figsize=(10, 6))
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Residual Value ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.axvline(x=0, color='r', linestyle='--', lw=2)
plt.tight_layout()
plt.savefig('residual_distribution.png')
plt.close()

The actual vs. predicted plot should show points clustering around the diagonal line. Systematic deviations suggest the model struggles with certain price ranges.

The residual plot should look like a random cloud centered on zero. Funnel shapes indicate heteroscedasticity (non-constant variance). Curves suggest non-linear relationships.

The residual distribution should approximate a normal distribution. Severe skewness or multiple peaks indicate model problems.

Advanced Considerations

Basic linear regression assumes relationships are linear and features are independent. Reality is messier.

When features are highly correlated or you have many predictors relative to observations, regularization helps. Ridge regression adds a penalty term that shrinks coefficients:

# Ridge regression for regularization
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
y_test_pred_ridge = ridge_model.predict(X_test)

ridge_r2 = r2_score(y_test, y_test_pred_ridge)
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred_ridge))

print(f"\nRidge Regression Performance:")
print(f"R²: {ridge_r2:.4f}")
print(f"RMSE: ${ridge_rmse:,.2f}")

# Compare coefficients
print("\nCoefficient Comparison:")
for feature, lr_coef, ridge_coef in zip(features, model.coef_, ridge_model.coef_):
    print(f"{feature}: LR={lr_coef:.2f}, Ridge={ridge_coef:.2f}")

The alpha parameter controls regularization strength. Higher values mean more shrinkage. Use cross-validation to find the optimal alpha.

Feature scaling matters for regularized models but not ordinary least squares. If you’re using Ridge or Lasso, standardize features first:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

When relationships aren’t linear, consider polynomial features:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

This creates interaction terms and squared terms, allowing the model to capture curvature. Be careful—polynomial features explode dimensionality quickly.

Multiple linear regression is straightforward to implement but requires thoughtful preparation and evaluation. Master these fundamentals before moving to more complex algorithms. Often, a well-executed linear regression outperforms a poorly implemented neural network.