How to Implement Linear Regression in Python

Key Insights

Linear regression models relationships between variables using the equation y = mx + b, making it ideal for prediction tasks where you need to understand how features influence outcomes
While scikit-learn provides production-ready implementations, building linear regression from scratch with NumPy deepens your understanding of gradient descent and model optimization
Model evaluation requires multiple metrics (R², MSE, RMSE) and visual diagnostics like residual plots—a single metric never tells the complete story

Introduction to Linear Regression

Linear regression is the foundation of predictive modeling. At its core, it finds the best-fit line through your data points, allowing you to predict continuous values based on input features. The mathematical formula is deceptively simple: y = mx + b, where y is your prediction, m is the slope, x is your input feature, and b is the y-intercept.

Use simple linear regression when you have one independent variable predicting one dependent variable—like predicting house prices based solely on square footage. Multiple linear regression extends this to multiple features: y = b₀ + b₁x₁ + b₂x₂ + … + bₙxₙ.

Linear regression excels at trend analysis, forecasting sales, estimating costs, and any scenario where relationships between variables are approximately linear. It’s interpretable, fast to train, and serves as a baseline for more complex models. However, it assumes linear relationships, normally distributed residuals, and no multicollinearity between features.

Setting Up Your Environment

Install the essential libraries for scientific computing and machine learning in Python:

pip install numpy pandas matplotlib scikit-learn

NumPy handles numerical operations, pandas manages data structures, matplotlib creates visualizations, and scikit-learn provides production-ready machine learning algorithms. These four libraries form the backbone of most data science workflows.

Implementing Linear Regression from Scratch

Building linear regression manually clarifies how the algorithm optimizes parameters. We’ll use the normal equation method, which directly computes the optimal coefficients.

import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic dataset
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add bias term (column of ones) to X
X_b = np.c_[np.ones((100, 1)), X]

# Calculate optimal parameters using normal equation
# theta = (X^T * X)^-1 * X^T * y
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

print(f"Intercept: {theta_best[0][0]:.2f}")
print(f"Slope: {theta_best[1][0]:.2f}")

This code generates 100 data points following the relationship y = 4 + 3x (with noise), then calculates the optimal intercept and slope. The normal equation provides an analytical solution without iteration.

Now let’s make predictions and visualize the regression line:

# Make predictions
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Data points')
plt.plot(X_new, y_predict, 'r-', linewidth=2, label='Predictions')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression from Scratch')
plt.grid(True, alpha=0.3)
plt.show()

The visualization shows how well your line fits the data. The red line should pass through the center of the point cloud, minimizing the vertical distances (residuals) between points and predictions.

Using Scikit-Learn for Linear Regression

For production code, use scikit-learn. It’s optimized, well-tested, and provides a consistent API across algorithms.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load California housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Use only median income for simple regression
X_simple = X[['MedInc']].values

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Coefficient: {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")

The coefficient tells you how much the house price changes per unit increase in median income. A positive coefficient indicates higher income areas have higher prices—exactly what you’d expect.

Make predictions on new data:

# Predict on test set
y_pred = model.predict(X_test)

# Predict for specific values
new_income = np.array([[3.5], [5.0], [8.0]])
predictions = model.predict(new_income)

for income, price in zip(new_income.flatten(), predictions):
    print(f"Median Income: {income:.1f} → Predicted Price: ${price:.2f}00k")

Evaluating Model Performance

Never trust a model without evaluating it. Use multiple metrics to assess performance:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Calculate metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)

print(f"R² Score: {r2:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")

R² (coefficient of determination) ranges from 0 to 1, indicating the proportion of variance explained by your model. An R² of 0.47 means your model explains 47% of the variance—decent for a single-feature model.

RMSE (root mean squared error) is in the same units as your target variable, making it interpretable. For housing prices in $100k units, an RMSE of 0.73 means predictions are off by about $73,000 on average.

Visualize residuals to check assumptions:

residuals = y_test - y_pred

plt.figure(figsize=(12, 4))

# Residual plot
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')

# Histogram of residuals
plt.subplot(1, 2, 2)
plt.hist(residuals, bins=50, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.tight_layout()
plt.show()

Residuals should be randomly scattered around zero with no patterns. A funnel shape indicates heteroscedasticity (non-constant variance), violating linear regression assumptions.

Handling Multiple Features

Real-world problems rarely involve single features. Multiple linear regression uses all available predictors:

# Use all features
X_train_full, X_test_full, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Feature scaling improves numerical stability
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_full)
X_test_scaled = scaler.transform(X_test_full)

# Train model
model_multi = LinearRegression()
model_multi.fit(X_train_scaled, y_train)

# Examine feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model_multi.coef_
}).sort_values('Coefficient', ascending=False)

print(feature_importance)

Scaling features to similar ranges prevents features with larger magnitudes from dominating the model. StandardScaler transforms features to have mean 0 and standard deviation 1.

Coefficients reveal each feature’s impact. Positive coefficients increase predictions; negative coefficients decrease them. The magnitude indicates strength of the relationship.

Cross-validation provides robust performance estimates:

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
cv_scores = cross_val_score(
    model_multi, X_train_scaled, y_train, 
    cv=5, scoring='r2'
)

print(f"Cross-validation R² scores: {cv_scores}")
print(f"Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

Best Practices and Common Pitfalls

Check for linearity first. Plot your features against the target. If relationships are curved, consider polynomial features or different algorithms.

Handle outliers carefully. Linear regression is sensitive to extreme values. Investigate outliers—they might be data errors or genuine edge cases requiring special treatment.

Avoid multicollinearity. When features correlate highly with each other, coefficient estimates become unstable. Use correlation matrices or Variance Inflation Factor (VIF) to detect this.

Don’t extrapolate beyond your data range. Linear models perform poorly outside the training data distribution. A model trained on houses 500-3000 sq ft shouldn’t predict for 10,000 sq ft mansions.

Feature selection matters. More features don’t always improve models. Use domain knowledge, correlation analysis, or automated methods like Recursive Feature Elimination.

When linear regression underperforms, consider regularization techniques. Ridge regression (L2 penalty) and Lasso regression (L1 penalty) add constraints that prevent overfitting and perform feature selection:

from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)

Linear regression remains relevant because it’s interpretable, fast, and often “good enough.” Master it thoroughly before moving to complex models. Sometimes the simplest solution is the best solution.