How to Implement t-SNE in Python
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique designed specifically for visualization. Unlike PCA, which preserves global variance, t-SNE focuses on...
Key Insights
- t-SNE excels at visualizing high-dimensional data by preserving local structure, making it superior to PCA for finding clusters, but it’s non-deterministic and computationally expensive for datasets over 10,000 samples.
- The perplexity parameter (typically 5-50) controls local vs global structure balance—start with 30 and adjust based on your dataset size; always preprocess with PCA for dimensions above 50.
- t-SNE visualizations are for exploration only: distances between clusters are meaningless, and you cannot transform new data points, so never use t-SNE features for downstream machine learning tasks.
Introduction to t-SNE
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique designed specifically for visualization. Unlike PCA, which preserves global variance, t-SNE focuses on maintaining local neighborhood structure, making it exceptional at revealing clusters in high-dimensional data.
Use t-SNE when you need to visualize complex datasets with hidden patterns—image features, word embeddings, gene expression data, or customer behavior profiles. Use PCA when you need interpretable components, want to transform new data, or require a deterministic, computationally efficient solution. PCA is also your preprocessing step before t-SNE for high-dimensional data.
The key difference: PCA is linear and reversible; t-SNE is non-linear and strictly for visualization. You cannot meaningfully interpret distances between clusters in t-SNE plots, and you cannot use t-SNE to transform test data.
Understanding the Mathematics (High-Level)
t-SNE works by converting high-dimensional Euclidean distances into probability distributions. For each point, it calculates the probability that it would pick each other point as its neighbor, based on a Gaussian distribution centered at that point. It then creates a low-dimensional map and uses a t-distribution (with heavier tails) to calculate similar probabilities in the reduced space.
The algorithm minimizes the Kullback-Leibler divergence between these two probability distributions through gradient descent. The t-distribution in low-dimensional space prevents the “crowding problem”—it gives points more room to spread out, making clusters more distinct.
Here’s a simple demonstration of the distance preservation concept:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# Create a toy dataset: three distinct clusters
np.random.seed(42)
cluster1 = np.random.randn(50, 10) + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
cluster2 = np.random.randn(50, 10) + [5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
cluster3 = np.random.randn(50, 10) + [-5, -5, -5, -5, -5, -5, -5, -5, -5, -5]
X = np.vstack([cluster1, cluster2, cluster3])
labels = np.array([0]*50 + [1]*50 + [2]*50)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_embedded = tsne.fit_transform(X)
# Visualize
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=labels, cmap='viridis')
plt.colorbar(scatter)
plt.title('t-SNE: 10D → 2D')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()
This example shows how t-SNE preserves local structure: points close together in 10D space remain close in 2D space, while maintaining separation between clusters.
Basic t-SNE Implementation with scikit-learn
Let’s apply t-SNE to the MNIST digits dataset, a classic use case for dimensionality reduction visualization:
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
# Load MNIST digits (8x8 images, 64 dimensions)
digits = load_digits()
X, y = digits.data, digits.target
# Apply t-SNE
tsne = TSNE(
n_components=2,
perplexity=30,
learning_rate=200,
n_iter=1000,
random_state=42
)
X_tsne = tsne.fit_transform(X)
# Create visualization
plt.figure(figsize=(12, 10))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1],
c=y, cmap='tab10',
alpha=0.6, s=20)
plt.colorbar(scatter, ticks=range(10))
plt.title('MNIST Digits Visualization with t-SNE', fontsize=16)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.tight_layout()
plt.show()
print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")
Key parameters explained:
n_components=2: Output dimensions (2 or 3 for visualization)perplexity=30: Balances local vs global structure (more on this next)learning_rate=200: Step size for gradient descent (10-1000 range)n_iter=1000: Number of optimization iterations (minimum 250)random_state=42: Ensures reproducibility
Tuning Hyperparameters
Perplexity is the most critical parameter. It roughly represents the number of nearest neighbors considered for each point. Small perplexity (5-15) focuses on very local structure; large perplexity (30-50) captures more global patterns.
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
digits = load_digits()
X, y = digits.data, digits.target
# Compare different perplexity values
perplexities = [5, 30, 50]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, perplexity in enumerate(perplexities):
tsne = TSNE(n_components=2,
perplexity=perplexity,
random_state=42,
n_iter=1000)
X_embedded = tsne.fit_transform(X)
axes[idx].scatter(X_embedded[:, 0], X_embedded[:, 1],
c=y, cmap='tab10', alpha=0.6, s=20)
axes[idx].set_title(f'Perplexity = {perplexity}\nKL div: {tsne.kl_divergence_:.4f}')
axes[idx].set_xlabel('Component 1')
axes[idx].set_ylabel('Component 2')
plt.tight_layout()
plt.show()
Rule of thumb: perplexity should be between 5 and 50, typically around 30. For datasets with fewer than 100 samples, use lower perplexity (5-15). For large datasets (10,000+), experiment with higher values (40-50).
Monitor convergence by checking the KL divergence—it should decrease and stabilize. If it’s still dropping rapidly, increase n_iter.
Working with Real-World Datasets
Let’s visualize high-dimensional text embeddings from a document clustering task:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
# Sample documents (replace with your corpus)
documents = [
"machine learning algorithms for classification",
"deep neural networks and backpropagation",
"supervised learning with labeled data",
"stock market analysis and trading strategies",
"financial portfolio optimization techniques",
"investment risk management strategies",
"natural language processing with transformers",
"text classification using BERT embeddings",
"sentiment analysis for customer reviews"
]
# Create TF-IDF embeddings
vectorizer = TfidfVectorizer(max_features=100)
X = vectorizer.fit_transform(documents).toarray()
# Preprocessing: always scale your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=3, random_state=42)
X_embedded = tsne.fit_transform(X_scaled)
# Visualize with labels
plt.figure(figsize=(10, 8))
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], s=100)
for i, doc in enumerate(documents):
plt.annotate(doc[:30] + '...',
(X_embedded[i, 0], X_embedded[i, 1]),
fontsize=8, alpha=0.7)
plt.title('Document Clustering Visualization')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.tight_layout()
plt.show()
For real BERT embeddings (768 dimensions), always preprocess:
# Assuming you have BERT embeddings in X_bert (shape: n_samples, 768)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Step 1: Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_bert)
# Step 2: PCA preprocessing (critical for high dimensions)
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_scaled)
# Step 3: t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_pca)
Performance Optimization
t-SNE is computationally expensive: O(n²) complexity. For datasets over 10,000 samples, use these optimization strategies:
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import time
# Create large dataset
X_large, y_large = make_blobs(n_samples=5000, n_features=100,
centers=10, random_state=42)
# Method 1: Direct t-SNE (slow)
start = time.time()
tsne_direct = TSNE(n_components=2, random_state=42)
X_direct = tsne_direct.fit_transform(X_large)
print(f"Direct t-SNE: {time.time() - start:.2f}s")
# Method 2: PCA + t-SNE (faster, recommended)
start = time.time()
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_large)
tsne_pca = TSNE(n_components=2, random_state=42)
X_pca_tsne = tsne_pca.fit_transform(X_pca)
print(f"PCA + t-SNE: {time.time() - start:.2f}s")
For even larger datasets, use optimized implementations:
# Install: pip install openTSNE
from openTSNE import TSNE as OpenTSNE
# OpenTSNE is significantly faster for large datasets
tsne_fast = OpenTSNE(
n_components=2,
perplexity=30,
n_jobs=-1, # Use all CPU cores
random_state=42
)
X_fast = tsne_fast.fit(X_large)
Common Pitfalls and Best Practices
Pitfall 1: Non-reproducibility. Always set random_state:
# Bad: different results each run
tsne = TSNE(n_components=2)
# Good: reproducible results
tsne = TSNE(n_components=2, random_state=42)
Pitfall 2: Misinterpreting cluster distances. The distance between clusters in t-SNE plots is meaningless. Only use t-SNE to identify that clusters exist, not to measure relationships between them.
Pitfall 3: Using t-SNE for feature engineering. Never use t-SNE components as features for machine learning models. t-SNE is non-deterministic and cannot transform new data.
Best Practice: Interactive visualization for better exploration:
import plotly.express as px
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
digits = load_digits()
X, y = digits.data, digits.target
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# Create interactive plot
fig = px.scatter(
x=X_tsne[:, 0],
y=X_tsne[:, 1],
color=y.astype(str),
labels={'x': 't-SNE 1', 'y': 't-SNE 2', 'color': 'Digit'},
title='Interactive t-SNE Visualization',
hover_data={'x': ':.2f', 'y': ':.2f'}
)
fig.show()
When NOT to use t-SNE:
- Dataset has fewer than 3 dimensions (just plot it directly)
- You need to transform test data (use PCA or UMAP instead)
- You need interpretable components (use PCA)
- You have more than 50,000 samples without access to optimized implementations
t-SNE is a powerful visualization tool when used correctly. Combine it with domain knowledge, proper preprocessing, and parameter tuning to unlock insights hidden in your high-dimensional data.