How to Implement UMAP in Python

Key Insights

UMAP outperforms t-SNE in speed and preserving both local and global data structure, making it ideal for large datasets with 10,000+ samples
The n_neighbors parameter controls the balance between local and global structure (5-50 range), while min_dist determines how tightly points cluster (0.0-0.99)
Using supervised UMAP with target labels can improve cluster separation by up to 40% and dramatically enhance downstream classification performance

Introduction to UMAP

Uniform Manifold Approximation and Projection (UMAP) has rapidly become the go-to dimensionality reduction technique for modern machine learning workflows. Unlike PCA, which only captures linear relationships, or t-SNE, which struggles with computational efficiency, UMAP excels at preserving both local neighborhoods and global data structure while maintaining reasonable runtime performance.

UMAP shines in three primary scenarios: visualizing high-dimensional data in 2D or 3D space, preprocessing features before clustering or classification, and exploring dataset structure during exploratory data analysis. Where t-SNE might take hours on datasets with 100,000 samples, UMAP completes the same task in minutes. Where PCA loses critical non-linear patterns, UMAP captures them faithfully.

The algorithm works by constructing a high-dimensional graph representation of your data, then optimizing a low-dimensional graph to be as structurally similar as possible. This mathematical foundation gives UMAP its unique ability to maintain meaningful distances at multiple scales simultaneously.

Installation and Setup

Getting started with UMAP requires the umap-learn library and its dependencies. Install everything you need with pip:

pip install umap-learn numpy scikit-learn matplotlib seaborn

For working with large datasets, consider installing additional acceleration libraries:

pip install pynndescent  # Faster nearest neighbor search

Here are the essential imports for a typical UMAP workflow:

import numpy as np
import umap
from sklearn.datasets import load_digits, load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visualizations
sns.set(style='white', context='notebook', rc={'figure.figsize':(12,8)})

Basic UMAP Implementation

Let’s start with a straightforward example using the digits dataset, which contains 1,797 samples of 8x8 pixel handwritten digits (64 features per sample):

# Load the digits dataset
digits = load_digits()
X, y = digits.data, digits.target

print(f"Original shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")

# Create and fit UMAP
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(X)

print(f"Reduced shape: {embedding.shape}")

# Visualize the results
plt.figure(figsize=(12, 8))
scatter = plt.scatter(embedding[:, 0], embedding[:, 1], 
                     c=y, cmap='Spectral', s=5, alpha=0.7)
plt.colorbar(scatter, label='Digit')
plt.title('UMAP Projection of Digits Dataset')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.tight_layout()
plt.show()

With default parameters, UMAP produces clean separation between digit classes. The entire process takes seconds, and you can immediately see cluster structure that corresponds to the different digits.

Key Hyperparameters

Understanding UMAP’s hyperparameters is crucial for getting optimal results. The two most important parameters control fundamentally different aspects of the embedding:

n_neighbors determines how UMAP balances local versus global structure. Low values (5-15) focus on local patterns and create tighter, more fragmented clusters. High values (50-200) emphasize global structure and produce more connected embeddings. For most applications, values between 10 and 50 work well.

min_dist controls how tightly UMAP packs points together in the low-dimensional space. Values close to 0.0 create tight, distinct clusters. Values approaching 1.0 produce more evenly distributed embeddings. The default of 0.1 works well for visualization, but use 0.0 for maximum cluster separation.

Here’s a comparison showing these parameters’ effects:

from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

# Test different parameter combinations
params = [
    {'n_neighbors': 5, 'min_dist': 0.0, 'title': 'n_neighbors=5, min_dist=0.0'},
    {'n_neighbors': 5, 'min_dist': 0.5, 'title': 'n_neighbors=5, min_dist=0.5'},
    {'n_neighbors': 50, 'min_dist': 0.0, 'title': 'n_neighbors=50, min_dist=0.0'},
    {'n_neighbors': 50, 'min_dist': 0.5, 'title': 'n_neighbors=50, min_dist=0.5'},
]

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, param_set in enumerate(params):
    reducer = umap.UMAP(
        n_neighbors=param_set['n_neighbors'],
        min_dist=param_set['min_dist'],
        random_state=42
    )
    embedding = reducer.fit_transform(X)
    
    axes[idx].scatter(embedding[:, 0], embedding[:, 1], 
                     c=y, cmap='Spectral', s=3, alpha=0.6)
    axes[idx].set_title(param_set['title'])
    axes[idx].set_xlabel('UMAP 1')
    axes[idx].set_ylabel('UMAP 2')

plt.tight_layout()
plt.show()

Other important parameters include n_components (default 2, increase for more dimensions), metric (default ’euclidean’, but ‘cosine’ works better for text embeddings), and n_epochs (controls optimization iterations, auto-calculated by default).

Real-World Application: High-Dimensional Data

Let’s tackle a realistic scenario: reducing MNIST digit images (784 dimensions) for visualization and downstream analysis. This example includes proper preprocessing:

from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
import time

# Load MNIST (this may take a moment on first run)
print("Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, parser='auto')
X, y = mnist.data.values, mnist.target.values.astype(int)

# Use a subset for faster demonstration
n_samples = 10000
indices = np.random.choice(X.shape[0], n_samples, replace=False)
X_subset = X[indices]
y_subset = y[indices]

# Scale the data (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_subset)

print(f"Processing {X_scaled.shape[0]} samples with {X_scaled.shape[1]} features")

# Apply UMAP with optimized parameters for this dataset
start_time = time.time()
reducer = umap.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    metric='euclidean',
    random_state=42,
    verbose=True
)
embedding = reducer.fit_transform(X_scaled)
elapsed = time.time() - start_time

print(f"UMAP completed in {elapsed:.2f} seconds")

# Create an enhanced visualization
plt.figure(figsize=(14, 10))
scatter = plt.scatter(embedding[:, 0], embedding[:, 1], 
                     c=y_subset, cmap='tab10', 
                     s=8, alpha=0.6, edgecolors='none')
plt.colorbar(scatter, label='Digit Class', ticks=range(10))
plt.title(f'UMAP Projection of {n_samples} MNIST Samples\n'
          f'n_neighbors=15, min_dist=0.1, time={elapsed:.2f}s')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

This produces a clear embedding where digits cluster by class, with similar digits (like 4 and 9, or 3 and 5) positioned near each other—demonstrating UMAP’s ability to preserve meaningful structure.

Integration with Machine Learning Pipelines

UMAP excels as a preprocessing step before classification or clustering. Here’s a complete pipeline comparing model performance with and without UMAP:

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Load and prepare data
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Pipeline WITHOUT UMAP
print("Training without UMAP...")
clf_baseline = RandomForestClassifier(n_estimators=100, random_state=42)
clf_baseline.fit(X_train, y_train)
y_pred_baseline = clf_baseline.predict(X_test)
acc_baseline = accuracy_score(y_test, y_pred_baseline)

# Pipeline WITH UMAP
print("Training with UMAP preprocessing...")
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('umap', umap.UMAP(n_components=10, n_neighbors=15, random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred_umap = pipeline.predict(X_test)
acc_umap = accuracy_score(y_test, y_pred_umap)

print(f"\nResults:")
print(f"Baseline accuracy: {acc_baseline:.4f}")
print(f"With UMAP accuracy: {acc_umap:.4f}")
print(f"Improvement: {(acc_umap - acc_baseline):.4f}")

For clustering applications, UMAP often reveals structure that algorithms like K-means can exploit more effectively:

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

# Reduce dimensions with UMAP
reducer = umap.UMAP(n_components=10, random_state=42)
X_reduced = reducer.fit_transform(X)

# Cluster in reduced space
kmeans = KMeans(n_clusters=10, random_state=42)
clusters = kmeans.fit_predict(X_reduced)

# Evaluate clustering quality
ari_score = adjusted_rand_score(y, clusters)
print(f"Adjusted Rand Index: {ari_score:.4f}")

Performance Tips and Best Practices

Supervised UMAP leverages label information during embedding construction, producing superior separation for classification tasks:

# Standard unsupervised UMAP
reducer_unsupervised = umap.UMAP(random_state=42)
embedding_unsupervised = reducer_unsupervised.fit_transform(X)

# Supervised UMAP with target labels
reducer_supervised = umap.UMAP(random_state=42)
embedding_supervised = reducer_supervised.fit_transform(X, y=y)

# Visualize both
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

ax1.scatter(embedding_unsupervised[:, 0], embedding_unsupervised[:, 1],
           c=y, cmap='Spectral', s=5, alpha=0.7)
ax1.set_title('Unsupervised UMAP')

ax2.scatter(embedding_supervised[:, 0], embedding_supervised[:, 1],
           c=y, cmap='Spectral', s=5, alpha=0.7)
ax2.set_title('Supervised UMAP')

plt.tight_layout()
plt.show()

Key best practices:

Always scale your data before applying UMAP, especially when features have different units or ranges
Use cosine distance for text embeddings or normalized vectors
Start with n_neighbors=15 and adjust based on your data size (larger datasets benefit from higher values)
Set min_dist=0.0 when you need maximum cluster separation for downstream clustering
Increase n_components to 10-50 when using UMAP for preprocessing rather than visualization
Use supervised mode when you have labels and care about class separation

Avoid these pitfalls:

Don’t interpret distances between clusters as meaningful (UMAP optimizes local structure)
Don’t apply UMAP to already low-dimensional data (under 10 dimensions)
Don’t use UMAP embeddings for distance-based predictions on new data (use transform method carefully)

UMAP transforms how we work with high-dimensional data. Its speed, quality, and flexibility make it indispensable for modern machine learning workflows. Start with the examples above, experiment with parameters on your specific data, and you’ll quickly develop intuition for when and how to deploy this powerful technique.