Linear Algebra: Least Squares Explained
You have data points scattered across a plot. You need a line, curve, or model that best represents the relationship. The problem? No single line passes through all points perfectly. This is the...
Key Insights
- Least squares solves overdetermined linear systems by minimizing the squared error ||Ax - b||², leading to the normal equations A^T A x = A^T b that project b onto the column space of A
- While normal equations are intuitive, QR decomposition offers better numerical stability and SVD handles rank-deficient matrices—choose your solver based on matrix properties and scale
- Linear regression is just least squares in disguise: fitting y = mx + b means finding coefficients that minimize prediction error, with regularization (Ridge) preventing ill-conditioned matrix explosions
Introduction: The Problem of Fitting Data
You have data points scattered across a plot. You need a line, curve, or model that best represents the relationship. The problem? No single line passes through all points perfectly. This is the reality of overdetermined systems—more equations than unknowns, no exact solution possible.
Least squares gives us a principled answer: find the solution that minimizes the sum of squared errors. Not arbitrary, not hand-wavy—mathematically optimal under specific assumptions. This technique underpins linear regression, curve fitting, computer vision algorithms, and countless engineering applications.
import numpy as np
import matplotlib.pyplot as plt
# Generate noisy data
np.random.seed(42)
x = np.linspace(0, 10, 50)
y_true = 2.5 * x + 1.5
y_noisy = y_true + np.random.normal(0, 3, size=x.shape)
plt.scatter(x, y_noisy, alpha=0.6, label='Data')
plt.plot(x, y_true, 'r--', label='True relationship')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Data Requiring a Best Fit Line')
plt.show()
The challenge: find parameters that minimize the distance between predictions and observations. Least squares provides the answer.
The Mathematical Foundation
Consider the linear system Ax = b where A is m×n with m > n (more equations than unknowns). No exact solution exists in general. Instead, we seek x that minimizes ||Ax - b||², the squared Euclidean norm of the residual.
Taking the derivative and setting to zero:
d/dx ||Ax - b||² = d/dx (Ax - b)^T(Ax - b)
= 2A^T(Ax - b) = 0
This yields the normal equations: A^T A x = A^T b
Geometrically, Ax represents vectors in the column space of A. The vector b typically lies outside this subspace. We’re projecting b onto col(A), finding the closest point Ax̂. The residual b - Ax̂ is orthogonal to col(A), meaning A^T(b - Ax̂) = 0—exactly the normal equations.
# Geometric visualization (2D example)
A = np.array([[1, 0], [0, 1], [1, 1]]) # 3x2 matrix
b = np.array([1, 2, 4])
# Solve normal equations
ATA = A.T @ A
ATb = A.T @ b
x_hat = np.linalg.solve(ATA, ATb)
# Projection
b_proj = A @ x_hat
residual = b - b_proj
print(f"Solution: {x_hat}")
print(f"Residual norm: {np.linalg.norm(residual):.4f}")
print(f"Orthogonality check A^T(b-Ax): {A.T @ residual}") # Should be ~0
The beauty: this projection minimizes distance automatically. No iterative optimization needed for linear problems.
Solving Least Squares Problems
Three methods, each with tradeoffs:
Method 1: Normal Equations
Direct but potentially unstable. The condition number of A^T A is the square of A’s condition number, amplifying numerical errors.
def solve_normal_equations(A, b):
ATA = A.T @ A
ATb = A.T @ b
return np.linalg.solve(ATA, ATb)
Method 2: QR Decomposition
Factor A = QR where Q is orthogonal and R is upper triangular. Since Q^T Q = I, the normal equations become Rx = Q^T b. More stable than normal equations.
def solve_qr(A, b):
Q, R = np.linalg.qr(A)
return np.linalg.solve(R, Q.T @ b)
Method 3: SVD (Singular Value Decomposition)
Factor A = UΣV^T. The solution is x = VΣ^(-1)U^T b. Handles rank-deficient matrices and provides insight into problem structure. Most expensive but most robust.
def solve_svd(A, b):
U, s, Vt = np.linalg.svd(A, full_matrices=False)
return Vt.T @ (U.T @ b / s)
# Comparison
A = np.random.randn(100, 5)
b = np.random.randn(100)
x_normal = solve_normal_equations(A, b)
x_qr = solve_qr(A, b)
x_svd = solve_svd(A, b)
x_numpy = np.linalg.lstsq(A, b, rcond=None)[0] # NumPy's implementation
print(f"Normal equations: {np.linalg.norm(A @ x_normal - b):.6f}")
print(f"QR decomposition: {np.linalg.norm(A @ x_qr - b):.6f}")
print(f"SVD: {np.linalg.norm(A @ x_svd - b):.6f}")
print(f"NumPy lstsq: {np.linalg.norm(A @ x_numpy - b):.6f}")
Complexity: Normal equations O(n²m + n³), QR O(mn²), SVD O(mn² + n³). For tall-skinny matrices (m » n), QR is the sweet spot.
Linear Regression as Least Squares
Linear regression fits y = β₀ + β₁x₁ + … + βₙxₙ. This is least squares with A as the design matrix and b as the response vector.
# Linear regression from scratch
def linear_regression(X, y):
"""
X: n_samples × n_features
y: n_samples
Returns: coefficients, predictions, R²
"""
# Add intercept column
X_design = np.column_stack([np.ones(len(X)), X])
# Solve least squares
coeffs = np.linalg.lstsq(X_design, y, rcond=None)[0]
# Predictions and metrics
y_pred = X_design @ coeffs
residuals = y - y_pred
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y - np.mean(y))**2)
r_squared = 1 - (ss_res / ss_tot)
return coeffs, y_pred, r_squared, residuals
# Example: polynomial regression
x = np.linspace(0, 10, 50)
y_true = 0.5 * x**2 - 2 * x + 5
y_noisy = y_true + np.random.normal(0, 5, size=x.shape)
# Create polynomial features
X_poly = np.column_stack([x, x**2])
coeffs, y_pred, r2, residuals = linear_regression(X_poly, y_noisy)
print(f"Coefficients: {coeffs}")
print(f"R²: {r2:.4f}")
# Visualize
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(x, y_noisy, alpha=0.6, label='Data')
plt.plot(x, y_pred, 'r-', label='Fit', linewidth=2)
plt.legend()
plt.title('Polynomial Fit')
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()
R² measures goodness of fit: 1.0 is perfect, 0.0 means the model is no better than predicting the mean. Residual plots reveal model inadequacies—patterns indicate missing structure.
Practical Considerations
Ill-Conditioning: When A^T A is nearly singular, small data changes cause huge solution changes. The condition number κ(A) = σ_max/σ_min quantifies this.
# Demonstrate ill-conditioning
def create_ill_conditioned_matrix(n, condition_number):
"""Create matrix with specified condition number"""
U, _ = np.linalg.qr(np.random.randn(n, n))
V, _ = np.linalg.qr(np.random.randn(n, n))
s = np.logspace(0, -np.log10(condition_number), n)
return U @ np.diag(s) @ V.T
A_bad = create_ill_conditioned_matrix(50, 1e10)
b = np.random.randn(50)
print(f"Condition number: {np.linalg.cond(A_bad):.2e}")
# Small perturbation
b_perturbed = b + 1e-6 * np.random.randn(50)
x1 = np.linalg.lstsq(A_bad, b, rcond=None)[0]
x2 = np.linalg.lstsq(A_bad, b_perturbed, rcond=None)[0]
print(f"Solution change: {np.linalg.norm(x1 - x2):.4e}")
Ridge Regression (L2 regularization) fixes this by adding λI to A^T A:
def ridge_regression(A, b, lambda_reg):
"""Regularized least squares"""
n = A.shape[1]
ATA_reg = A.T @ A + lambda_reg * np.eye(n)
ATb = A.T @ b
return np.linalg.solve(ATA_reg, ATb)
# Compare solutions
x_ols = np.linalg.lstsq(A_bad, b, rcond=None)[0]
x_ridge = ridge_regression(A_bad, b, lambda_reg=1.0)
print(f"OLS norm: {np.linalg.norm(x_ols):.4f}")
print(f"Ridge norm: {np.linalg.norm(x_ridge):.4f}") # Smaller, more stable
Weighted Least Squares: When error variance varies across observations, weight them accordingly. Minimize ||W(Ax - b)||² where W is diagonal weights.
Real-World Applications
Sensor Calibration: Fit polynomial to temperature sensor readings against reference.
# Temperature sensor calibration
reference_temp = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
sensor_reading = np.array([0.2, 10.5, 20.8, 30.2, 39.5, 49.8, 60.5, 71.2, 81.5, 90.8, 99.5])
# Fit cubic polynomial
degrees = 3
X = np.column_stack([reference_temp**i for i in range(degrees + 1)])
coeffs = np.linalg.lstsq(X, sensor_reading, rcond=None)[0]
# Calibration function
def calibrate(raw_reading, coeffs):
# Invert the polynomial (simplified: use fitted inverse)
X_inv = np.column_stack([raw_reading**i for i in range(len(coeffs))])
return X_inv @ coeffs
# Validate
test_temps = np.linspace(0, 100, 200)
X_test = np.column_stack([test_temps**i for i in range(degrees + 1)])
predicted = X_test @ coeffs
plt.plot(reference_temp, sensor_reading, 'o', label='Measurements')
plt.plot(test_temps, predicted, '-', label='Calibration curve')
plt.xlabel('Reference Temperature (°C)')
plt.ylabel('Sensor Reading')
plt.legend()
plt.title('Sensor Calibration via Least Squares')
plt.show()
print(f"Calibration coefficients: {coeffs}")
Other applications: homography estimation in computer vision (8+ point correspondences), system identification in control theory, and GPS trilateration.
Conclusion & Further Reading
Least squares is the workhorse of data fitting. Master the normal equations conceptually, use QR for practical computation, and reach for SVD when matrices misbehave. Linear regression is just least squares wearing a statistics hat.
Key takeaways: understand the geometry (projection), choose solvers wisely (stability matters), and regularize when needed (Ridge for ill-conditioning). The math is elegant, the applications endless.
Advanced topics: Non-linear least squares (Gauss-Newton, Levenberg-Marquardt), robust regression (RANSAC, Huber loss), and iterative methods for massive-scale problems (conjugate gradients, LSQR).
Resources: “Numerical Linear Algebra” by Trefethen & Bau for theory, “Matrix Computations” by Golub & Van Loan for algorithms, and scikit-learn’s documentation for production implementations.
Now go fit some data.