PCA: Complete Guide with Examples
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible....
Key Insights
- PCA reduces dimensionality by projecting data onto directions of maximum variance, transforming correlated features into uncorrelated principal components through eigendecomposition of the covariance matrix
- Always standardize features before applying PCA since the algorithm is sensitive to feature scales—failure to do so will cause high-variance features to dominate the principal components
- Standard PCA only captures linear relationships and fails on manifold data like the Swiss roll; use kernel PCA or other manifold learning techniques for non-linear dimensionality reduction
Introduction to PCA
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. It solves a fundamental problem in machine learning: the curse of dimensionality.
As dimensions increase, data becomes increasingly sparse. The volume of the space grows exponentially, requiring exponentially more data to maintain statistical significance. This makes algorithms slower, models more prone to overfitting, and visualization impossible beyond three dimensions.
PCA addresses this by finding new axes (principal components) that capture the most important patterns in your data. Use PCA when you need to:
- Reduce training time for machine learning models
- Visualize high-dimensional data in 2D or 3D
- Remove noise and redundant features
- Compress data for storage or transmission
Here’s a quick demonstration of PCA’s power:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
# Load high-dimensional data (64 features)
digits = load_digits()
X = digits.data
y = digits.target
print(f"Original shape: {X.shape}") # (1797, 64)
# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Reduced shape: {X_reduced.shape}") # (1797, 2)
print(f"Variance preserved: {pca.explained_variance_ratio_.sum():.2%}")
# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('64D Digit Data Reduced to 2D')
plt.show()
We’ve compressed 64 dimensions down to 2 while retaining meaningful structure—digits cluster by class even in this drastically reduced space.
Mathematical Foundation
PCA works by finding directions of maximum variance in your data. The intuition: variance indicates information. If a feature barely varies, it tells you little. If it varies significantly, it’s informative.
The algorithm finds these directions through linear algebra:
-
Covariance Matrix: Measures how features vary together. For features X and Y, positive covariance means they increase together; negative means they move oppositely.
-
Eigenvectors and Eigenvalues: The covariance matrix’s eigenvectors point in directions of variance. Their corresponding eigenvalues indicate how much variance exists in each direction.
-
Principal Components: Eigenvectors with the largest eigenvalues become your principal components—the new axes for your data.
Let’s see this manually on a simple 2D dataset:
import numpy as np
import matplotlib.pyplot as plt
# Create sample data
np.random.seed(42)
X = np.random.randn(100, 2)
X[:, 1] = X[:, 0] * 2 + np.random.randn(100) * 0.5 # Create correlation
# Step 1: Center the data (subtract mean)
X_centered = X - X.mean(axis=0)
# Step 2: Compute covariance matrix
cov_matrix = np.cov(X_centered.T)
print("Covariance Matrix:")
print(cov_matrix)
# Step 3: Compute eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Step 4: Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Visualize
plt.figure(figsize=(10, 5))
plt.scatter(X_centered[:, 0], X_centered[:, 1], alpha=0.6)
origin = np.zeros(2)
plt.quiver(*origin, *eigenvectors[:, 0], color='r', scale=3, label='PC1')
plt.quiver(*origin, *eigenvectors[:, 1], color='g', scale=3, label='PC2')
plt.axis('equal')
plt.legend()
plt.title('Principal Components as Eigenvectors')
plt.show()
# Explained variance ratio
explained_var = eigenvalues / eigenvalues.sum()
print(f"\nExplained variance ratio: {explained_var}")
The first principal component (red arrow) points in the direction of maximum variance. The second is orthogonal to it, capturing remaining variance.
Implementing PCA from Scratch
Understanding the algorithm deeply means implementing it yourself. Here’s PCA in pure NumPy:
class PCAFromScratch:
def __init__(self, n_components):
self.n_components = n_components
self.components_ = None
self.mean_ = None
self.explained_variance_ = None
def fit(self, X):
# Center the data
self.mean_ = np.mean(X, axis=0)
X_centered = X - self.mean_
# Compute covariance matrix
cov_matrix = np.cov(X_centered.T)
# Compute eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalue descending
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Store first n components
self.components_ = eigenvectors[:, :self.n_components].T
self.explained_variance_ = eigenvalues[:self.n_components]
return self
def transform(self, X):
# Center and project onto principal components
X_centered = X - self.mean_
return X_centered @ self.components_.T
def fit_transform(self, X):
self.fit(X)
return self.transform(X)
# Compare with sklearn
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
# Our implementation
pca_custom = PCAFromScratch(n_components=2)
X_custom = pca_custom.fit_transform(X)
# sklearn implementation
pca_sklearn = PCA(n_components=2)
X_sklearn = pca_sklearn.fit_transform(X)
# Compare results (may differ by sign, which is fine)
print("Custom PCA shape:", X_custom.shape)
print("Sklearn PCA shape:", X_sklearn.shape)
print("\nExplained variance (custom):", pca_custom.explained_variance_)
print("Explained variance (sklearn):", pca_sklearn.explained_variance_)
print("\nResults match (allowing sign flip):",
np.allclose(np.abs(X_custom), np.abs(X_sklearn)))
The implementations match perfectly. The sign difference is irrelevant—PCA components are defined only up to sign.
Practical Applications
Choosing the Number of Components
Never arbitrarily pick the number of components. Use the explained variance ratio:
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
# Fit PCA with all components
pca = PCA()
pca.fit(X)
# Plot cumulative explained variance
cumsum = np.cumsum(pca.explained_variance_ratio_)
plt.figure(figsize=(10, 6))
plt.plot(cumsum, linewidth=2)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.axhline(y=0.90, color='g', linestyle='--', label='90% variance')
plt.legend()
plt.grid(True)
plt.show()
# Find n_components for 95% variance
n_components_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Components needed for 95% variance: {n_components_95}")
A common heuristic: retain 90-95% of variance. For the digits dataset, you need only 21 components (out of 64) to preserve 90% of variance.
MNIST Dimensionality Reduction
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from time import time
# Load MNIST
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False, parser='auto')
X = X / 255.0 # Normalize
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
# Train on full 784 dimensions
start = time()
clf_full = LogisticRegression(max_iter=100, solver='saga')
clf_full.fit(X_train, y_train)
time_full = time() - start
score_full = clf_full.score(X_test, y_test)
# Train on 50 PCA components
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
start = time()
clf_pca = LogisticRegression(max_iter=100, solver='saga')
clf_pca.fit(X_train_pca, y_train)
time_pca = time() - start
score_pca = clf_pca.score(X_test_pca, y_test)
print(f"Full dimensions - Time: {time_full:.2f}s, Accuracy: {score_full:.4f}")
print(f"PCA 50 dims - Time: {time_pca:.2f}s, Accuracy: {score_pca:.4f}")
print(f"Speedup: {time_full/time_pca:.2f}x")
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")
You’ll see dramatic speedups (often 5-10x) with minimal accuracy loss.
Image Compression
from sklearn.datasets import load_sample_image
import matplotlib.pyplot as plt
# Load sample image
image = load_sample_image('flower.jpg')
image = image / 255.0 # Normalize
# Apply PCA to each color channel
n_components = 50
compressed_channels = []
for i in range(3): # RGB channels
channel = image[:, :, i]
pca = PCA(n_components=n_components)
transformed = pca.fit_transform(channel)
reconstructed = pca.inverse_transform(transformed)
compressed_channels.append(reconstructed)
compressed_image = np.stack(compressed_channels, axis=2)
compressed_image = np.clip(compressed_image, 0, 1)
# Calculate compression ratio
original_size = image.shape[0] * image.shape[1] * 3
compressed_size = (pca.components_.shape[0] * pca.components_.shape[1] +
transformed.shape[0] * transformed.shape[1]) * 3
compression_ratio = original_size / compressed_size
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(image)
axes[0].set_title('Original')
axes[1].imshow(compressed_image)
axes[1].set_title(f'Compressed ({n_components} components)\n'
f'Ratio: {compression_ratio:.2f}:1')
plt.show()
PCA Best Practices and Pitfalls
Feature Scaling is Critical
PCA is sensitive to feature scales. Features with larger ranges dominate the principal components:
from sklearn.preprocessing import StandardScaler
# Create data with different scales
X = np.random.randn(100, 2)
X[:, 0] = X[:, 0] * 1000 # First feature much larger
# PCA without scaling
pca_no_scale = PCA(n_components=2)
pca_no_scale.fit(X)
print("Without scaling:")
print(f"Explained variance ratio: {pca_no_scale.explained_variance_ratio_}")
# PCA with scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca_scaled = PCA(n_components=2)
pca_scaled.fit(X_scaled)
print("\nWith scaling:")
print(f"Explained variance ratio: {pca_scaled.explained_variance_ratio_}")
Without scaling, the first component captures 99.9% of variance simply because one feature has a larger scale. Always use StandardScaler before PCA.
PCA Fails on Non-Linear Data
PCA assumes linear relationships. It fails spectacularly on manifold data:
from sklearn.datasets import make_swiss_roll
# Generate Swiss roll (non-linear manifold)
X_swiss, color = make_swiss_roll(n_samples=1000, noise=0.1)
# Standard PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_swiss)
# Kernel PCA (RBF kernel)
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.1)
X_kpca = kpca.fit_transform(X_swiss)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=color, cmap='Spectral')
axes[0].set_title('Standard PCA (fails to unroll)')
axes[1].scatter(X_kpca[:, 0], X_kpca[:, 1], c=color, cmap='Spectral')
axes[1].set_title('Kernel PCA (successfully unrolls)')
plt.show()
Standard PCA cannot unroll the Swiss roll because it’s a non-linear structure. Kernel PCA applies the kernel trick to capture non-linear relationships.
Advanced Techniques
Incremental PCA for Large Datasets
When data doesn’t fit in memory, use Incremental PCA:
from sklearn.decomposition import IncrementalPCA
# Simulate large dataset with batches
n_samples = 10000
n_features = 100
batch_size = 1000
ipca = IncrementalPCA(n_components=10, batch_size=batch_size)
for i in range(0, n_samples, batch_size):
X_batch = np.random.randn(batch_size, n_features)
ipca.partial_fit(X_batch)
print(f"Explained variance ratio: {ipca.explained_variance_ratio_}")
Incremental PCA processes data in chunks, making it suitable for datasets that exceed available RAM.
Sparse PCA
Sparse PCA adds L1 regularization to create interpretable components with many zero coefficients:
from sklearn.decomposition import SparsePCA
X = np.random.randn(100, 50)
# Standard PCA - dense components
pca = PCA(n_components=5)
pca.fit(X)
print(f"Standard PCA - Non-zero coefficients: {np.count_nonzero(pca.components_)}")
# Sparse PCA - sparse components
spca = SparsePCA(n_components=5, alpha=1.0)
spca.fit(X)
print(f"Sparse PCA - Non-zero coefficients: {np.count_nonzero(spca.components_)}")
Sparse PCA sacrifices some explained variance for interpretability—useful when you need to understand which original features contribute to each component.
Conclusion
PCA is a foundational technique for dimensionality reduction. Use it when you need faster training, better visualization, or noise reduction. Remember these key points:
- Always standardize features before applying PCA
- Choose components based on explained variance, not arbitrary numbers
- PCA only captures linear relationships—use kernel PCA or manifold learning for non-linear data
- Consider incremental PCA for large datasets and sparse PCA for interpretability
PCA works exceptionally well for many real-world problems, but it’s not a universal solution. For non-linear manifolds, explore t-SNE, UMAP, or autoencoders. For supervised dimensionality reduction, consider Linear Discriminant Analysis (LDA).
Start with PCA as your default dimensionality reduction technique. It’s fast, interpretable, and effective for most applications. Only reach for more complex methods when PCA demonstrably fails on your specific problem.