How to Perform Gram-Schmidt Orthogonalization in Python

Key Insights

Gram-Schmidt orthogonalization transforms linearly independent vectors into orthonormal bases, critical for numerical stability in linear algebra operations and machine learning algorithms
The modified Gram-Schmidt algorithm provides significantly better numerical stability than the classical version when dealing with nearly-dependent vectors or ill-conditioned matrices
For production code, use SciPy’s QR decomposition instead of custom implementations—it’s optimized, battle-tested, and handles edge cases you haven’t considered

Introduction to Orthogonalization

Orthogonalization is the process of converting a set of linearly independent vectors into a set of orthogonal (or orthonormal) vectors that span the same subspace. In practical terms, you’re taking vectors that point in arbitrary directions and rotating them so they’re perpendicular to each other while preserving the space they define.

Why does this matter? Orthogonal vectors have an inner product of zero, which dramatically simplifies many linear algebra operations. When working with orthonormal bases (orthogonal vectors with unit length), matrix operations become more numerically stable, eigenvalue computations converge faster, and least-squares problems have cleaner solutions.

Real-world applications are everywhere. QR decomposition—the foundation of many numerical algorithms—relies on orthogonalization. In signal processing, you need orthogonal bases to decompose signals without interference between components. Machine learning practitioners use orthogonalization for feature decorrelation, improving model conditioning and interpretability. If you’ve ever used PCA or worked with regression models suffering from multicollinearity, you’ve benefited from orthogonalization.

The Gram-Schmidt Process Explained

The Gram-Schmidt algorithm is elegantly simple. Given vectors v₁, v₂, …, vₙ, you iteratively construct orthogonal vectors u₁, u₂, …, uₙ.

Start with the first vector: u₁ = v₁. For each subsequent vector vₖ, subtract its projections onto all previously computed orthogonal vectors:

uₖ = vₖ - Σ(projᵤⱼ(vₖ)) for j = 1 to k-1

The projection formula is: projᵤ(v) = ((v·u)/(u·u)) * u

To get orthonormal vectors, normalize each uₖ by dividing by its magnitude: eₖ = uₖ / ||uₖ||

Let’s visualize this with 2D vectors:

import numpy as np
import matplotlib.pyplot as plt

# Two non-orthogonal vectors
v1 = np.array([3, 1])
v2 = np.array([2, 2])

# Apply Gram-Schmidt
u1 = v1
u2 = v2 - (np.dot(v2, u1) / np.dot(u1, u1)) * u1

# Normalize
e1 = u1 / np.linalg.norm(u1)
e2 = u2 / np.linalg.norm(u2)

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Before orthogonalization
ax1.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='r', width=0.01)
ax1.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='b', width=0.01)
ax1.set_xlim(-0.5, 3.5)
ax1.set_ylim(-0.5, 2.5)
ax1.grid(True)
ax1.set_aspect('equal')
ax1.set_title('Original Vectors')
ax1.legend(['v1', 'v2'])

# After orthogonalization
ax2.quiver(0, 0, e1[0], e1[1], angles='xy', scale_units='xy', scale=1, color='r', width=0.01)
ax2.quiver(0, 0, e2[0], e2[1], angles='xy', scale_units='xy', scale=1, color='b', width=0.01)
ax2.set_xlim(-0.5, 1.5)
ax2.set_ylim(-0.5, 1.5)
ax2.grid(True)
ax2.set_aspect('equal')
ax2.set_title('Orthonormal Vectors')
ax2.legend(['e1', 'e2'])

plt.tight_layout()
plt.show()

print(f"Dot product before: {np.dot(v1, v2):.4f}")
print(f"Dot product after: {np.dot(e1, e2):.4f}")

Implementing Classical Gram-Schmidt in NumPy

Here’s a complete implementation from scratch:

import numpy as np

def gram_schmidt_classical(A):
    """
    Perform classical Gram-Schmidt orthogonalization.
    
    Parameters:
    A : ndarray, shape (m, n)
        Matrix whose columns are vectors to orthogonalize
        
    Returns:
    Q : ndarray, shape (m, n)
        Matrix with orthonormal columns
    """
    m, n = A.shape
    Q = np.zeros((m, n))
    
    for j in range(n):
        # Start with the j-th column of A
        v = A[:, j].copy()
        
        # Subtract projections onto all previous orthonormal vectors
        for i in range(j):
            # Projection of A[:, j] onto Q[:, i]
            proj = np.dot(Q[:, i], A[:, j]) * Q[:, i]
            v -= proj
        
        # Normalize to get orthonormal vector
        norm = np.linalg.norm(v)
        if norm < 1e-10:
            raise ValueError(f"Vectors are linearly dependent at column {j}")
        
        Q[:, j] = v / norm
    
    return Q

# Test on a 3x3 matrix
A = np.array([
    [1.0, 1.0, 0.0],
    [1.0, 0.0, 1.0],
    [0.0, 1.0, 1.0]
])

Q = gram_schmidt_classical(A)
print("Orthonormal matrix Q:")
print(Q)
print("\nVerification Q^T Q (should be identity):")
print(Q.T @ Q)

The key insight here is the accumulation of projections. For each new vector, we systematically remove all components that lie in the directions of previously computed orthonormal vectors.

Modified Gram-Schmidt for Numerical Stability

The classical algorithm has a critical flaw: rounding errors accumulate. When vectors are nearly linearly dependent, small numerical errors in early projections propagate and amplify, resulting in vectors that aren’t truly orthogonal.

The modified Gram-Schmidt algorithm fixes this by updating the remaining vectors after each orthogonalization step, rather than computing all projections against the original vectors:

def gram_schmidt_modified(A):
    """
    Perform modified Gram-Schmidt orthogonalization.
    More numerically stable than classical GS.
    """
    m, n = A.shape
    Q = A.copy()  # Work with a copy
    R = np.zeros((n, n))
    
    for j in range(n):
        # Normalize the j-th vector
        R[j, j] = np.linalg.norm(Q[:, j])
        if R[j, j] < 1e-10:
            raise ValueError(f"Vectors are linearly dependent at column {j}")
        
        Q[:, j] = Q[:, j] / R[j, j]
        
        # Update all remaining vectors
        for i in range(j + 1, n):
            R[j, i] = np.dot(Q[:, j], Q[:, i])
            Q[:, i] = Q[:, i] - R[j, i] * Q[:, j]
    
    return Q, R

# Compare on an ill-conditioned matrix
epsilon = 1e-10
A_ill = np.array([
    [1.0, 1.0, 1.0],
    [epsilon, 0.0, 0.0],
    [0.0, epsilon, 0.0]
])

Q_classical = gram_schmidt_classical(A_ill)
Q_modified, R = gram_schmidt_modified(A_ill)

print("Classical GS orthogonality error:")
print(np.linalg.norm(Q_classical.T @ Q_classical - np.eye(3)))

print("\nModified GS orthogonality error:")
print(np.linalg.norm(Q_modified.T @ Q_modified - np.eye(3)))

The modified version typically shows orders of magnitude better numerical accuracy. This isn’t academic—it matters when you’re working with real data that often has near-dependencies.

Using SciPy’s Built-in QR Decomposition

For production code, don’t implement Gram-Schmidt yourself. Use SciPy’s QR decomposition, which implements the modified Gram-Schmidt algorithm (or more advanced methods like Householder reflections) with careful attention to numerical stability:

from scipy.linalg import qr
import time

# Generate a larger test matrix
np.random.seed(42)
A_large = np.random.randn(500, 100)

# Benchmark custom implementation
start = time.time()
Q_custom, _ = gram_schmidt_modified(A_large)
custom_time = time.time() - start

# Benchmark SciPy
start = time.time()
Q_scipy, R_scipy = qr(A_large, mode='economic')
scipy_time = time.time() - start

print(f"Custom implementation: {custom_time:.4f} seconds")
print(f"SciPy implementation: {scipy_time:.4f} seconds")
print(f"Speedup: {custom_time / scipy_time:.2f}x")

# Verify results match
print(f"\nMax difference in Q: {np.max(np.abs(np.abs(Q_custom) - np.abs(Q_scipy))):.2e}")

SciPy’s implementation is not only faster but handles edge cases, pivoting, and various decomposition modes you’ll eventually need.

Practical Applications and Examples

Let’s use Gram-Schmidt to orthogonalize features in a regression problem, reducing multicollinearity:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Create correlated features
X, y = make_regression(n_samples=200, n_features=5, n_informative=5, 
                       noise=10, random_state=42)

# Add multicollinearity by creating correlated features
X[:, 1] = X[:, 0] + np.random.randn(200) * 0.1
X[:, 2] = X[:, 0] + X[:, 1] + np.random.randn(200) * 0.1

# Check condition number (high = multicollinearity)
print(f"Original condition number: {np.linalg.cond(X):.2f}")

# Orthogonalize features
X_orth, _ = qr(X, mode='economic')

print(f"Orthogonalized condition number: {np.linalg.cond(X_orth):.2f}")

# Train models
model_original = LinearRegression().fit(X, y)
model_orth = LinearRegression().fit(X_orth, y)

print(f"\nOriginal R²: {r2_score(y, model_original.predict(X)):.4f}")
print(f"Orthogonalized R²: {r2_score(y, model_orth.predict(X_orth)):.4f}")

# Coefficient stability
print(f"\nOriginal coefficient std: {np.std(model_original.coef_):.4f}")
print(f"Orthogonalized coefficient std: {np.std(model_orth.coef_):.4f}")

Orthogonalization doesn’t change predictive power, but it makes coefficient estimates more stable and interpretable. Each orthogonalized feature represents a unique direction in the feature space.

Conclusion and Best Practices

Gram-Schmidt orthogonalization is fundamental to numerical linear algebra, but implementation details matter enormously. Use the modified algorithm if you must implement it yourself—the classical version fails spectacularly on real-world data.

For production code, always use scipy.linalg.qr(). It’s optimized, numerically stable, and maintained by people who think about edge cases professionally. Custom implementations are for learning, not deployment.

When applying orthogonalization to data, remember that it changes your feature space. Coefficients and importances are no longer directly comparable to the original features. Document this transformation clearly—future you (or your colleagues) will appreciate it.

Finally, check condition numbers before and after orthogonalization. If your condition number is still high after orthogonalization, you likely have true linear dependencies in your data that need addressing at the data collection or feature engineering level, not just numerical treatment.