How to Implement PCA in Python

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible....

Key Insights

  • PCA reduces dimensionality by projecting data onto directions of maximum variance, transforming correlated features into uncorrelated principal components that capture the most important patterns in your data.
  • Always standardize your features before applying PCA—variables with larger scales will dominate the principal components and produce misleading results.
  • Use the cumulative explained variance plot to choose components: aim for 95% variance retention as a starting point, but adjust based on your specific use case and computational constraints.

Introduction to PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. Instead of working with hundreds or thousands of potentially correlated features, PCA lets you capture the essential patterns in just a handful of uncorrelated components.

The practical applications are compelling: visualize high-dimensional datasets in 2D or 3D, speed up machine learning algorithms by reducing feature count, remove noise from data, and compress images. If you’ve ever wondered how to plot 784-dimensional MNIST digits on a scatter plot or why your neural network trains faster after PCA preprocessing, you’re in the right place.

The Mathematics Behind PCA

PCA works by finding the directions (principal components) along which your data varies the most. Think of it as rotating your coordinate system to align with the natural structure of your data.

The key concepts are straightforward:

  • Variance measures spread in your data. PCA seeks directions of maximum variance.
  • Covariance matrix captures how features vary together.
  • Eigenvectors of the covariance matrix define the principal component directions.
  • Eigenvalues indicate how much variance each component explains.

Let’s visualize this with a simple 2D example:

import numpy as np
import matplotlib.pyplot as plt

# Generate correlated 2D data
np.random.seed(42)
mean = [0, 0]
cov = [[3, 1.5], [1.5, 1]]
data = np.random.multivariate_normal(mean, cov, 300)

plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0], data[:, 1], alpha=0.6)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original Data with Correlation')
plt.axis('equal')
plt.grid(True)
plt.show()

The data spreads more along one diagonal direction than the perpendicular direction. PCA identifies these directions mathematically.

Manual PCA Implementation with NumPy

Building PCA from scratch clarifies what’s happening under the hood. Here’s the complete process:

import numpy as np

def pca_from_scratch(X, n_components):
    """
    Implement PCA manually using NumPy.
    
    Parameters:
    X: array-like, shape (n_samples, n_features)
    n_components: int, number of components to keep
    
    Returns:
    X_transformed: projected data
    components: principal components
    explained_variance_ratio: variance explained by each component
    """
    # Step 1: Standardize the data (zero mean, unit variance)
    X_mean = np.mean(X, axis=0)
    X_std = np.std(X, axis=0)
    X_standardized = (X - X_mean) / X_std
    
    # Step 2: Compute covariance matrix
    cov_matrix = np.cov(X_standardized.T)
    
    # Step 3: Calculate eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
    
    # Step 4: Sort by eigenvalues in descending order
    idx = eigenvalues.argsort()[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Step 5: Select top k eigenvectors
    components = eigenvectors[:, :n_components]
    
    # Step 6: Transform the data
    X_transformed = X_standardized @ components
    
    # Calculate explained variance ratio
    explained_variance_ratio = eigenvalues / np.sum(eigenvalues)
    
    return X_transformed, components, explained_variance_ratio[:n_components]

# Test with our 2D data
X_pca, components, var_ratio = pca_from_scratch(data, n_components=2)

print(f"Explained variance ratio: {var_ratio}")
print(f"Principal components shape: {components.shape}")

This implementation reveals the core algorithm: standardize, compute covariance, find eigenvectors, sort by importance, and project.

PCA Using scikit-learn

For production code, use scikit-learn’s optimized implementation:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize first (critical step!)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Fit PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.2%}")

# Access the principal components (eigenvectors)
print(f"Components shape: {pca.components_.shape}")

# Reconstruct original data (with information loss if n_components < n_features)
data_reconstructed = pca.inverse_transform(data_pca)
data_original_scale = scaler.inverse_transform(data_reconstructed)

The sklearn API is clean: fit the model, transform your data, and access explained variance through attributes. The inverse_transform method is particularly useful for understanding information loss.

Practical Example: MNIST Dimensionality Reduction

Let’s apply PCA to the MNIST dataset, reducing 784 dimensions (28×28 pixels) to 2 for visualization:

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load digits dataset (smaller version of MNIST)
digits = load_digits()
X, y = digits.data, digits.target

print(f"Original shape: {X.shape}")  # (1797, 64)

# Standardize and apply PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.title('MNIST Digits Projected onto 2 Principal Components')
plt.show()

print(f"Variance explained by 2 components: {pca.explained_variance_ratio_.sum():.2%}")

Notice how different digits cluster in different regions. Even with just 2 components capturing ~30% of variance, we see meaningful structure.

Now let’s compare classification performance:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Original data
rf_original = RandomForestClassifier(n_estimators=100, random_state=42)
scores_original = cross_val_score(rf_original, X_scaled, y, cv=5)

# PCA with 20 components
pca_20 = PCA(n_components=20)
X_pca_20 = pca_20.fit_transform(X_scaled)
rf_pca = RandomForestClassifier(n_estimators=100, random_state=42)
scores_pca = cross_val_score(rf_pca, X_pca_20, y, cv=5)

print(f"Original (64 features): {scores_original.mean():.3f} (+/- {scores_original.std():.3f})")
print(f"PCA (20 features): {scores_pca.mean():.3f} (+/- {scores_pca.std():.3f})")
print(f"Variance retained: {pca_20.explained_variance_ratio_.sum():.2%}")

You’ll typically see minimal accuracy loss with 70% fewer features—a worthwhile trade-off for faster training and reduced overfitting risk.

Choosing the Number of Components

Don’t guess the number of components. Use data-driven approaches:

# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)

# Plot explained variance ratio (scree plot)
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1), 
         pca_full.explained_variance_ratio_, 'bo-')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')
plt.grid(True)

# Cumulative explained variance
plt.subplot(1, 2, 2)
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
plt.plot(range(1, len(cumsum) + 1), cumsum, 'ro-')
plt.axhline(y=0.95, color='k', linestyle='--', label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Variance Explained')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# Find number of components for 95% variance
n_components_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Components needed for 95% variance: {n_components_95}")

The scree plot shows where adding components yields diminishing returns (look for the “elbow”). The cumulative plot directly answers “how many components for X% variance?”

Best Practices and Common Pitfalls

Always standardize your features. This cannot be overstated. Here’s what happens when you don’t:

# Without scaling
pca_no_scale = PCA(n_components=2)
X_pca_no_scale = pca_no_scale.fit_transform(X)

# With scaling
pca_with_scale = PCA(n_components=2)
X_pca_with_scale = pca_with_scale.fit_transform(X_scaled)

print("Without scaling - variance ratio:", pca_no_scale.explained_variance_ratio_)
print("With scaling - variance ratio:", pca_with_scale.explained_variance_ratio_)

Features with larger scales dominate the principal components, making results meaningless.

Other critical considerations:

  • PCA assumes linear relationships. For non-linear manifolds, consider t-SNE or UMAP for visualization, or kernel PCA for modeling.
  • Principal components are not interpretable. They’re linear combinations of all original features. Don’t try to explain PC1 as “this represents image brightness”—it’s more complex.
  • PCA is sensitive to outliers. Consider robust scaling or outlier removal first.
  • Temporal data requires care. Don’t apply PCA across time series without understanding that you’re mixing temporal information.

When to use PCA:

  • High-dimensional data visualization
  • Speeding up algorithms (especially distance-based methods)
  • Noise reduction in signals/images
  • Addressing multicollinearity before regression

When to skip PCA:

  • Your features are already uncorrelated
  • You need interpretable features (use feature selection instead)
  • You have non-linear relationships (try kernel PCA or autoencoders)
  • You have very few features (the overhead isn’t worth it)

PCA is a powerful tool, but it’s not magic. Use it when dimensionality is genuinely problematic, always standardize your data, and validate that your reduced representation still captures the patterns you care about.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.