Linear Regression: Complete Guide with Examples

Linear regression models the relationship between variables by fitting a linear equation to observed data. At its core, it's the familiar equation from algebra: y = mx + b, where we predict an output...

Key Insights

  • Linear regression remains one of the most interpretable and computationally efficient machine learning algorithms, making it ideal for baseline models and production systems where explainability matters
  • The algorithm’s effectiveness depends critically on understanding its assumptions—linearity, independence, homoscedasticity, and normality of residuals—which when violated, lead to unreliable predictions
  • Modern implementations should use sklearn’s Pipeline API with proper preprocessing, regularization, and cross-validation rather than building from scratch, but understanding the underlying mathematics is essential for debugging and optimization

Introduction to Linear Regression

Linear regression models the relationship between variables by fitting a linear equation to observed data. At its core, it’s the familiar equation from algebra: y = mx + b, where we predict an output (y) based on input features (x) using learned parameters (m for slope, b for intercept).

This simplicity makes linear regression the foundation of predictive modeling. Real-world applications include predicting house prices from square footage, forecasting sales based on advertising spend, estimating crop yields from rainfall, and analyzing the relationship between variables in scientific research.

import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data with linear relationship
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = 2.5 * X + 1.5 + np.random.normal(0, 2, 100)

# Visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Data points')
plt.plot(X, 2.5 * X + 1.5, 'r-', linewidth=2, label='True relationship')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear Relationship with Noise')
plt.show()

Mathematical Foundation

Linear regression works by minimizing the difference between predicted and actual values. The cost function, typically Mean Squared Error (MSE), measures this difference:

MSE = (1/n) × Σ(y_actual - y_predicted)²

We optimize parameters using gradient descent (iterative updates) or the normal equation (closed-form solution). Gradient descent works by calculating the derivative of the cost function and updating parameters in the direction that reduces error.

def compute_cost(X, y, theta):
    """Calculate MSE cost function"""
    m = len(y)
    predictions = X.dot(theta)
    squared_errors = (predictions - y) ** 2
    cost = (1 / (2 * m)) * np.sum(squared_errors)
    return cost

# Example: How cost changes with different parameters
X_example = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])  # Add bias column
y_example = np.array([2, 4, 6, 8])

costs = []
theta_values = np.linspace(-2, 4, 100)

for theta_1 in theta_values:
    theta = np.array([0, theta_1])
    cost = compute_cost(X_example, y_example, theta)
    costs.append(cost)

plt.figure(figsize=(10, 6))
plt.plot(theta_values, costs)
plt.xlabel('Theta_1 (slope)')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function vs Parameter Value')
plt.axvline(x=2, color='r', linestyle='--', label='Optimal theta_1=2')
plt.legend()
plt.show()

Simple Linear Regression Implementation

Building linear regression from scratch clarifies how the algorithm works. Here’s a complete implementation using gradient descent:

class LinearRegressionScratch:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.iterations):
            y_pred = np.dot(X, self.weights) + self.bias
            
            # Gradients
            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
    
    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

# House price prediction example
house_sizes = np.array([750, 900, 1100, 1350, 1600, 1850, 2100]).reshape(-1, 1)
prices = np.array([150, 180, 220, 260, 300, 340, 380])  # in thousands

model_scratch = LinearRegressionScratch(learning_rate=0.0001, iterations=10000)
model_scratch.fit(house_sizes, prices)

print(f"Weight: {model_scratch.weights[0]:.4f}")
print(f"Bias: {model_scratch.bias:.4f}")
print(f"Predicted price for 1500 sqft: ${model_scratch.predict([[1500]])[0]:.2f}k")

Compare with scikit-learn’s optimized implementation:

from sklearn.linear_model import LinearRegression

model_sklearn = LinearRegression()
model_sklearn.fit(house_sizes, prices)

print(f"Sklearn - Coefficient: {model_sklearn.coef_[0]:.4f}")
print(f"Sklearn - Intercept: {model_sklearn.intercept_:.4f}")
print(f"Sklearn - Predicted price for 1500 sqft: ${model_sklearn.predict([[1500]])[0]:.2f}k")

Multiple Linear Regression

Real-world problems involve multiple features. Multiple linear regression extends the concept: y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

Feature scaling becomes critical when features have different ranges. Multicollinearity (highly correlated features) can destabilize the model.

import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Real estate dataset with multiple features
data = pd.DataFrame({
    'size_sqft': [750, 900, 1100, 1350, 1600, 1850, 2100, 2300],
    'bedrooms': [1, 2, 2, 3, 3, 4, 4, 5],
    'age_years': [15, 10, 8, 5, 3, 2, 1, 0],
    'location_score': [6, 7, 7, 8, 8, 9, 9, 10],
    'price_k': [150, 180, 220, 260, 300, 340, 380, 420]
})

# Check correlations
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Prepare features
X = data[['size_sqft', 'bedrooms', 'age_years', 'location_score']]
y = data['price_k']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train model
model = LinearRegression()
model.fit(X_scaled, y)

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)

print(feature_importance)

Model Evaluation and Validation

Proper evaluation prevents overfitting and ensures generalization. Key metrics include:

  • R² (R-squared): Proportion of variance explained (0 to 1, higher is better)
  • RMSE (Root Mean Squared Error): Average prediction error in original units
  • MAE (Mean Absolute Error): Average absolute difference, less sensitive to outliers
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Train and evaluate
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")

# Cross-validation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, 
                            scoring='r2')
print(f"CV R² scores: {cv_scores}")
print(f"Mean CV R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Residual analysis
residuals = y_test - y_pred
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')

plt.subplot(1, 2, 2)
plt.hist(residuals, bins=10, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')
plt.tight_layout()
plt.show()

Advanced Topics and Limitations

Linear regression assumes linear relationships, but many real-world patterns are non-linear. Polynomial regression transforms features to capture curves while maintaining linear algebra:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso

# Generate non-linear data
X_nonlinear = np.linspace(0, 10, 100).reshape(-1, 1)
y_nonlinear = 2 * X_nonlinear**2 + 3 * X_nonlinear + 1 + np.random.normal(0, 10, (100, 1))

# Polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_nonlinear)

# Compare regularization techniques
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)

ridge.fit(X_poly, y_nonlinear)
lasso.fit(X_poly, y_nonlinear)

print("Ridge coefficients:", ridge.coef_)
print("Lasso coefficients:", lasso.coef_)

Ridge (L2) regularization shrinks coefficients proportionally, while Lasso (L1) can zero out features entirely, performing feature selection. Use Ridge when all features are potentially relevant; use Lasso for automatic feature selection.

Production Considerations

Production models require robust pipelines that handle preprocessing, training, and deployment consistently:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import joblib

# Complete production pipeline
numeric_features = ['size_sqft', 'age_years', 'location_score']
categorical_features = ['neighborhood']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', Ridge(alpha=1.0))
])

# Example data with categorical features
production_data = pd.DataFrame({
    'size_sqft': [1200, 1500, 1800],
    'age_years': [5, 3, 1],
    'location_score': [7, 8, 9],
    'neighborhood': ['downtown', 'suburb', 'downtown'],
    'price_k': [250, 290, 350]
})

X_prod = production_data.drop('price_k', axis=1)
y_prod = production_data['price_k']

# Train and save
pipeline.fit(X_prod, y_prod)
joblib.dump(pipeline, 'price_prediction_model.pkl')

# Load and predict
loaded_pipeline = joblib.load('price_prediction_model.pkl')
new_house = pd.DataFrame({
    'size_sqft': [1600],
    'age_years': [2],
    'location_score': [8],
    'neighborhood': ['suburb']
})
prediction = loaded_pipeline.predict(new_house)
print(f"Predicted price: ${prediction[0]:.2f}k")

Monitor model performance in production by tracking prediction distributions, residual patterns, and metric drift. Retrain when performance degrades or data distributions shift. Linear regression’s simplicity makes it excellent for establishing baselines before exploring complex models—often, it’s sufficient for the task at hand.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.