Linear Algebra: Least Squares Explained

Key Insights

Least squares solves overdetermined linear systems by minimizing the squared error ||Ax - b||², leading to the normal equations A^T A x = A^T b that project b onto the column space of A
While normal equations are intuitive, QR decomposition offers better numerical stability and SVD handles rank-deficient matrices—choose your solver based on matrix properties and scale
Linear regression is just least squares in disguise: fitting y = mx + b means finding coefficients that minimize prediction error, with regularization (Ridge) preventing ill-conditioned matrix explosions

Introduction: The Problem of Fitting Data

You have data points scattered across a plot. You need a line, curve, or model that best represents the relationship. The problem? No single line passes through all points perfectly. This is the reality of overdetermined systems—more equations than unknowns, no exact solution possible.

Least squares gives us a principled answer: find the solution that minimizes the sum of squared errors. Not arbitrary, not hand-wavy—mathematically optimal under specific assumptions. This technique underpins linear regression, curve fitting, computer vision algorithms, and countless engineering applications.

import numpy as np
import matplotlib.pyplot as plt

# Generate noisy data
np.random.seed(42)
x = np.linspace(0, 10, 50)
y_true = 2.5 * x + 1.5
y_noisy = y_true + np.random.normal(0, 3, size=x.shape)

plt.scatter(x, y_noisy, alpha=0.6, label='Data')
plt.plot(x, y_true, 'r--', label='True relationship')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Data Requiring a Best Fit Line')
plt.show()

The challenge: find parameters that minimize the distance between predictions and observations. Least squares provides the answer.

The Mathematical Foundation

Consider the linear system Ax = b where A is m×n with m > n (more equations than unknowns). No exact solution exists in general. Instead, we seek x that minimizes ||Ax - b||², the squared Euclidean norm of the residual.

Taking the derivative and setting to zero:

d/dx ||Ax - b||² = d/dx (Ax - b)^T(Ax - b)
                  = 2A^T(Ax - b) = 0

This yields the normal equations: A^T A x = A^T b

Geometrically, Ax represents vectors in the column space of A. The vector b typically lies outside this subspace. We’re projecting b onto col(A), finding the closest point Ax̂. The residual b - Ax̂ is orthogonal to col(A), meaning A^T(b - Ax̂) = 0—exactly the normal equations.

# Geometric visualization (2D example)
A = np.array([[1, 0], [0, 1], [1, 1]])  # 3x2 matrix
b = np.array([1, 2, 4])

# Solve normal equations
ATA = A.T @ A
ATb = A.T @ b
x_hat = np.linalg.solve(ATA, ATb)

# Projection
b_proj = A @ x_hat
residual = b - b_proj

print(f"Solution: {x_hat}")
print(f"Residual norm: {np.linalg.norm(residual):.4f}")
print(f"Orthogonality check A^T(b-Ax): {A.T @ residual}")  # Should be ~0

The beauty: this projection minimizes distance automatically. No iterative optimization needed for linear problems.

Solving Least Squares Problems

Three methods, each with tradeoffs:

Method 1: Normal Equations

Direct but potentially unstable. The condition number of A^T A is the square of A’s condition number, amplifying numerical errors.

def solve_normal_equations(A, b):
    ATA = A.T @ A
    ATb = A.T @ b
    return np.linalg.solve(ATA, ATb)

Method 2: QR Decomposition

Factor A = QR where Q is orthogonal and R is upper triangular. Since Q^T Q = I, the normal equations become Rx = Q^T b. More stable than normal equations.

def solve_qr(A, b):
    Q, R = np.linalg.qr(A)
    return np.linalg.solve(R, Q.T @ b)

Method 3: SVD (Singular Value Decomposition)

Factor A = UΣV^T. The solution is x = VΣ^(-1)U^T b. Handles rank-deficient matrices and provides insight into problem structure. Most expensive but most robust.

def solve_svd(A, b):
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    return Vt.T @ (U.T @ b / s)

# Comparison
A = np.random.randn(100, 5)
b = np.random.randn(100)

x_normal = solve_normal_equations(A, b)
x_qr = solve_qr(A, b)
x_svd = solve_svd(A, b)
x_numpy = np.linalg.lstsq(A, b, rcond=None)[0]  # NumPy's implementation

print(f"Normal equations: {np.linalg.norm(A @ x_normal - b):.6f}")
print(f"QR decomposition: {np.linalg.norm(A @ x_qr - b):.6f}")
print(f"SVD: {np.linalg.norm(A @ x_svd - b):.6f}")
print(f"NumPy lstsq: {np.linalg.norm(A @ x_numpy - b):.6f}")

Complexity: Normal equations O(n²m + n³), QR O(mn²), SVD O(mn² + n³). For tall-skinny matrices (m » n), QR is the sweet spot.

Linear Regression as Least Squares

Linear regression fits y = β₀ + β₁x₁ + … + βₙxₙ. This is least squares with A as the design matrix and b as the response vector.

# Linear regression from scratch
def linear_regression(X, y):
    """
    X: n_samples × n_features
    y: n_samples
    Returns: coefficients, predictions, R²
    """
    # Add intercept column
    X_design = np.column_stack([np.ones(len(X)), X])
    
    # Solve least squares
    coeffs = np.linalg.lstsq(X_design, y, rcond=None)[0]
    
    # Predictions and metrics
    y_pred = X_design @ coeffs
    residuals = y - y_pred
    ss_res = np.sum(residuals**2)
    ss_tot = np.sum((y - np.mean(y))**2)
    r_squared = 1 - (ss_res / ss_tot)
    
    return coeffs, y_pred, r_squared, residuals

# Example: polynomial regression
x = np.linspace(0, 10, 50)
y_true = 0.5 * x**2 - 2 * x + 5
y_noisy = y_true + np.random.normal(0, 5, size=x.shape)

# Create polynomial features
X_poly = np.column_stack([x, x**2])

coeffs, y_pred, r2, residuals = linear_regression(X_poly, y_noisy)

print(f"Coefficients: {coeffs}")
print(f"R²: {r2:.4f}")

# Visualize
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(x, y_noisy, alpha=0.6, label='Data')
plt.plot(x, y_pred, 'r-', label='Fit', linewidth=2)
plt.legend()
plt.title('Polynomial Fit')

plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()

R² measures goodness of fit: 1.0 is perfect, 0.0 means the model is no better than predicting the mean. Residual plots reveal model inadequacies—patterns indicate missing structure.

Practical Considerations

Ill-Conditioning: When A^T A is nearly singular, small data changes cause huge solution changes. The condition number κ(A) = σ_max/σ_min quantifies this.

# Demonstrate ill-conditioning
def create_ill_conditioned_matrix(n, condition_number):
    """Create matrix with specified condition number"""
    U, _ = np.linalg.qr(np.random.randn(n, n))
    V, _ = np.linalg.qr(np.random.randn(n, n))
    s = np.logspace(0, -np.log10(condition_number), n)
    return U @ np.diag(s) @ V.T

A_bad = create_ill_conditioned_matrix(50, 1e10)
b = np.random.randn(50)

print(f"Condition number: {np.linalg.cond(A_bad):.2e}")

# Small perturbation
b_perturbed = b + 1e-6 * np.random.randn(50)

x1 = np.linalg.lstsq(A_bad, b, rcond=None)[0]
x2 = np.linalg.lstsq(A_bad, b_perturbed, rcond=None)[0]

print(f"Solution change: {np.linalg.norm(x1 - x2):.4e}")

Ridge Regression (L2 regularization) fixes this by adding λI to A^T A:

def ridge_regression(A, b, lambda_reg):
    """Regularized least squares"""
    n = A.shape[1]
    ATA_reg = A.T @ A + lambda_reg * np.eye(n)
    ATb = A.T @ b
    return np.linalg.solve(ATA_reg, ATb)

# Compare solutions
x_ols = np.linalg.lstsq(A_bad, b, rcond=None)[0]
x_ridge = ridge_regression(A_bad, b, lambda_reg=1.0)

print(f"OLS norm: {np.linalg.norm(x_ols):.4f}")
print(f"Ridge norm: {np.linalg.norm(x_ridge):.4f}")  # Smaller, more stable

Weighted Least Squares: When error variance varies across observations, weight them accordingly. Minimize ||W(Ax - b)||² where W is diagonal weights.

Real-World Applications

Sensor Calibration: Fit polynomial to temperature sensor readings against reference.

# Temperature sensor calibration
reference_temp = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
sensor_reading = np.array([0.2, 10.5, 20.8, 30.2, 39.5, 49.8, 60.5, 71.2, 81.5, 90.8, 99.5])

# Fit cubic polynomial
degrees = 3
X = np.column_stack([reference_temp**i for i in range(degrees + 1)])
coeffs = np.linalg.lstsq(X, sensor_reading, rcond=None)[0]

# Calibration function
def calibrate(raw_reading, coeffs):
    # Invert the polynomial (simplified: use fitted inverse)
    X_inv = np.column_stack([raw_reading**i for i in range(len(coeffs))])
    return X_inv @ coeffs

# Validate
test_temps = np.linspace(0, 100, 200)
X_test = np.column_stack([test_temps**i for i in range(degrees + 1)])
predicted = X_test @ coeffs

plt.plot(reference_temp, sensor_reading, 'o', label='Measurements')
plt.plot(test_temps, predicted, '-', label='Calibration curve')
plt.xlabel('Reference Temperature (°C)')
plt.ylabel('Sensor Reading')
plt.legend()
plt.title('Sensor Calibration via Least Squares')
plt.show()

print(f"Calibration coefficients: {coeffs}")

Other applications: homography estimation in computer vision (8+ point correspondences), system identification in control theory, and GPS trilateration.

Conclusion & Further Reading

Least squares is the workhorse of data fitting. Master the normal equations conceptually, use QR for practical computation, and reach for SVD when matrices misbehave. Linear regression is just least squares wearing a statistics hat.

Key takeaways: understand the geometry (projection), choose solvers wisely (stability matters), and regularize when needed (Ridge for ill-conditioning). The math is elegant, the applications endless.

Advanced topics: Non-linear least squares (Gauss-Newton, Levenberg-Marquardt), robust regression (RANSAC, Huber loss), and iterative methods for massive-scale problems (conjugate gradients, LSQR).

Resources: “Numerical Linear Algebra” by Trefethen & Bau for theory, “Matrix Computations” by Golub & Van Loan for algorithms, and scikit-learn’s documentation for production implementations.

Now go fit some data.