How to Perform LU Decomposition in Python

Key Insights

LU decomposition factors a matrix into lower and upper triangular matrices, enabling efficient solution of multiple linear systems with the same coefficient matrix—a single O(n³) decomposition followed by O(n²) substitutions beats O(n³) per solve.
Always use partial pivoting (PA = LU) in production code to avoid numerical instability from small pivot elements; SciPy’s lu() function handles this automatically while naive implementations fail on perfectly valid matrices.
For solving linear systems, LU decomposition shines when you need multiple solutions with the same coefficient matrix but different right-hand sides—the decomposition cost is amortized across all solves.

Introduction to LU Decomposition

LU decomposition is a fundamental matrix factorization technique that decomposes a square matrix A into the product of two triangular matrices: a lower triangular matrix L and an upper triangular matrix U. Mathematically, this is expressed as A = LU (or PA = LU when pivoting is involved).

This decomposition is invaluable for solving systems of linear equations, computing determinants, and performing matrix inversions. Unlike methods that solve linear systems directly, LU decomposition becomes particularly efficient when you need to solve multiple systems with the same coefficient matrix but different right-hand sides. The decomposition itself costs O(n³) operations, but once computed, each subsequent solve requires only O(n²) operations through forward and backward substitution.

When should you choose LU over other decomposition methods? Use LU for general dense matrices where you need to solve linear systems. Choose Cholesky decomposition for symmetric positive-definite matrices (it’s twice as fast). Opt for QR decomposition when dealing with ill-conditioned or rectangular matrices. For sparse matrices, specialized sparse solvers typically outperform dense LU decomposition.

Mathematical Foundation

In LU decomposition, the lower triangular matrix L has ones on its diagonal and zeros above, while the upper triangular matrix U has zeros below the diagonal. For a 3×3 matrix, this looks like:

import numpy as np

# Example of L and U structure
L = np.array([
    [1.0, 0.0, 0.0],
    [2.1, 1.0, 0.0],
    [0.5, 3.2, 1.0]
])

U = np.array([
    [4.0, 2.0, 1.0],
    [0.0, 3.5, 2.1],
    [0.0, 0.0, 1.8]
])

A = L @ U
print("Original matrix A from LU:")
print(A)

Not all matrices have an LU decomposition. The decomposition exists if all leading principal minors are non-zero. However, even when a decomposition exists mathematically, numerical issues arise when pivot elements (diagonal elements during elimination) are small. This is where partial pivoting becomes critical.

Partial pivoting introduces a permutation matrix P, giving us PA = LU. The algorithm swaps rows to ensure the largest available element becomes the pivot, dramatically improving numerical stability. The computational complexity remains O(n³) for the decomposition, making it comparable to Gaussian elimination but more versatile for subsequent operations.

Using SciPy for LU Decomposition

SciPy provides a robust implementation through scipy.linalg.lu(). This function returns three matrices: P (permutation), L (lower triangular), and U (upper triangular).

import numpy as np
from scipy.linalg import lu

# Create a test matrix
A = np.array([
    [2, 5, 8],
    [4, 3, 1],
    [1, 7, 9]
], dtype=float)

# Perform LU decomposition
P, L, U = lu(A)

print("Permutation matrix P:")
print(P)
print("\nLower triangular L:")
print(L)
print("\nUpper triangular U:")
print(U)

To verify the decomposition is correct, we can reconstruct the original matrix:

# Verify: P @ A should equal L @ U
reconstructed = L @ U
original_permuted = P @ A

print("\nL @ U:")
print(reconstructed)
print("\nP @ A:")
print(original_permuted)
print("\nAre they equal?", np.allclose(reconstructed, original_permuted))

# Also verify: P.T @ L @ U should equal A
print("\nP.T @ L @ U equals A?", np.allclose(P.T @ L @ U, A))

The permutation matrix P is orthogonal, so P.T @ P = I. This means we can recover the original matrix A by computing P.T @ L @ U.

Implementing LU Decomposition from Scratch

Understanding the algorithm deepens your grasp of numerical linear algebra. Here’s a Doolittle algorithm implementation with partial pivoting:

def lu_decomposition(A):
    """
    Perform LU decomposition with partial pivoting.
    Returns P, L, U such that PA = LU
    """
    n = len(A)
    # Create copies to avoid modifying input
    U = A.copy()
    L = np.eye(n)
    P = np.eye(n)
    
    for k in range(n - 1):
        # Partial pivoting: find the row with largest pivot
        pivot_row = k + np.argmax(np.abs(U[k:, k]))
        
        if pivot_row != k:
            # Swap rows in U, L, and P
            U[[k, pivot_row]] = U[[pivot_row, k]]
            P[[k, pivot_row]] = P[[pivot_row, k]]
            if k > 0:
                L[[k, pivot_row], :k] = L[[pivot_row, k], :k]
        
        # Elimination
        for i in range(k + 1, n):
            factor = U[i, k] / U[k, k]
            L[i, k] = factor
            U[i, k:] -= factor * U[k, k:]
    
    return P, L, U

# Test our implementation
A = np.array([
    [2, 5, 8],
    [4, 3, 1],
    [1, 7, 9]
], dtype=float)

P_custom, L_custom, U_custom = lu_decomposition(A)

# Compare with SciPy
P_scipy, L_scipy, U_scipy = lu(A)

print("Custom implementation matches SciPy?")
print("P match:", np.allclose(P_custom, P_scipy))
print("L match:", np.allclose(L_custom, L_scipy))
print("U match:", np.allclose(U_custom, U_scipy))

Solving Linear Systems with LU Decomposition

Once you have the LU decomposition, solving Ax = b becomes a two-step process: forward substitution to solve Ly = Pb, then backward substitution to solve Ux = y.

def forward_substitution(L, b):
    """Solve Ly = b for y"""
    n = len(b)
    y = np.zeros(n)
    for i in range(n):
        y[i] = b[i] - np.dot(L[i, :i], y[:i])
    return y

def backward_substitution(U, y):
    """Solve Ux = y for x"""
    n = len(y)
    x = np.zeros(n)
    for i in range(n - 1, -1, -1):
        x[i] = (y[i] - np.dot(U[i, i+1:], x[i+1:])) / U[i, i]
    return x

def solve_lu(P, L, U, b):
    """Solve Ax = b using LU decomposition"""
    Pb = P @ b
    y = forward_substitution(L, Pb)
    x = backward_substitution(U, y)
    return x

# Solve a linear system
b = np.array([1, 2, 3], dtype=float)
x = solve_lu(P_custom, L_custom, U_custom, b)

print("Solution x:", x)
print("Verification Ax:", A @ x)
print("Expected b:", b)
print("Match?", np.allclose(A @ x, b))

The real power emerges when solving multiple systems with the same A but different b vectors:

# Multiple right-hand sides
b_vectors = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
], dtype=float).T

# Decompose once
P, L, U = lu(A)

# Solve multiple times efficiently
solutions = []
for b in b_vectors.T:
    x = solve_lu(P, L, U, b)
    solutions.append(x)

solutions = np.array(solutions).T
print("\nSolutions for multiple right-hand sides:")
print(solutions)

Practical Applications and Performance

Computing determinants becomes trivial with LU decomposition—the determinant of A equals the product of U’s diagonal elements times the determinant of P (which is ±1):

def determinant_lu(A):
    """Compute determinant using LU decomposition"""
    P, L, U = lu(A)
    # det(A) = det(P) * det(L) * det(U)
    # det(L) = 1 (diagonal is all ones)
    # det(U) = product of diagonal
    # det(P) = (-1)^(number of row swaps)
    det_U = np.prod(np.diag(U))
    det_P = np.linalg.det(P)
    return det_P * det_U

det_lu = determinant_lu(A)
det_numpy = np.linalg.det(A)
print(f"Determinant (LU): {det_lu}")
print(f"Determinant (NumPy): {det_numpy}")

Performance comparison shows when LU decomposition pays off:

import time

# Create a larger system
n = 500
A_large = np.random.rand(n, n)
b_vectors = np.random.rand(n, 10)

# Method 1: Direct solve for each system
start = time.time()
for b in b_vectors.T:
    x = np.linalg.solve(A_large, b)
direct_time = time.time() - start

# Method 2: LU decomposition once, then solve
start = time.time()
P, L, U = lu(A_large)
for b in b_vectors.T:
    x = solve_lu(P, L, U, b)
lu_time = time.time() - start

print(f"\nDirect solve (10 systems): {direct_time:.4f}s")
print(f"LU decomposition method: {lu_time:.4f}s")
print(f"Speedup: {direct_time/lu_time:.2f}x")

Common Pitfalls and Best Practices

Always handle edge cases and numerical issues properly:

def safe_lu_solve(A, b):
    """Safely solve Ax = b with error handling"""
    try:
        # Check if matrix is square
        if A.shape[0] != A.shape[1]:
            raise ValueError("Matrix must be square")
        
        # Check if dimensions match
        if len(b) != A.shape[0]:
            raise ValueError("Dimension mismatch between A and b")
        
        # Perform LU decomposition
        P, L, U = lu(A)
        
        # Check for singularity (zero on diagonal of U)
        if np.any(np.abs(np.diag(U)) < 1e-10):
            raise np.linalg.LinAlgError("Matrix is singular or nearly singular")
        
        # Solve the system
        x = solve_lu(P, L, U, b)
        
        # Verify solution quality
        residual = np.linalg.norm(A @ x - b)
        if residual > 1e-6:
            print(f"Warning: Large residual {residual:.2e}")
        
        return x
    
    except np.linalg.LinAlgError as e:
        print(f"Linear algebra error: {e}")
        return None

# Test with a singular matrix
A_singular = np.array([
    [1, 2, 3],
    [2, 4, 6],
    [1, 1, 1]
], dtype=float)

b = np.array([1, 2, 3], dtype=float)
result = safe_lu_solve(A_singular, b)

For large matrices, consider memory efficiency. LU decomposition requires storing three n×n matrices. If memory is constrained, use in-place algorithms or sparse matrix implementations. For matrices larger than 10,000×10,000, iterative methods often outperform direct factorization.

Never implement LU decomposition without pivoting in production code. A matrix like [[1e-20, 1], [1, 1]] will fail catastrophically without pivoting despite being perfectly well-conditioned. SciPy’s implementation handles this automatically, which is why you should prefer it for production use while implementing from scratch only for educational purposes or when you need specific customizations.