How to Create a Cluster Map in Seaborn
Cluster maps are one of the most powerful visualization tools for exploring multidimensional data. They combine two analytical techniques: hierarchical clustering and heatmaps. While a standard...
Key Insights
- Cluster maps combine hierarchical clustering dendrograms with heatmaps, automatically grouping similar rows and columns to reveal patterns that standard heatmaps miss
- The
sns.clustermap()function provides extensive control over clustering algorithms (ward, average, complete linkage) and distance metrics (euclidean, correlation, cosine) to match your data’s characteristics - Standardizing your data with z-score normalization before clustering prevents features with larger scales from dominating the distance calculations and producing misleading groupings
Introduction to Cluster Maps
Cluster maps are one of the most powerful visualization tools for exploring multidimensional data. They combine two analytical techniques: hierarchical clustering and heatmaps. While a standard heatmap shows you the raw values, a cluster map goes further by automatically reordering rows and columns based on similarity, placing related items next to each other.
This reordering is crucial. Without it, you might miss patterns in your data simply because similar items happen to be far apart in your original dataset. The dendrograms (tree-like diagrams) on the sides of a cluster map show you exactly how the algorithm grouped your data, providing insight into the hierarchical relationships.
Use cluster maps when you need to identify natural groupings in your data, discover which features behave similarly, or find outliers that don’t cluster with anything else. They’re particularly valuable in genomics (gene expression patterns), marketing (customer segmentation), and any domain where you have many observations across many variables.
Setting Up Your Environment
You’ll need seaborn, pandas, numpy, and matplotlib. Seaborn handles the heavy lifting for clustering and visualization, while the other libraries help with data manipulation and display customization.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set the style for better-looking plots
sns.set_theme(style="white")
# Load a sample dataset - we'll use the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(
data=iris.data,
columns=iris.feature_names
)
# Add species information for later use
iris_df['species'] = pd.Categorical.from_codes(
iris.target,
iris.target_names
)
print(iris_df.head())
print(f"Dataset shape: {iris_df.shape}")
The iris dataset contains 150 samples with 4 measurements each, making it perfect for demonstrating cluster maps without overwhelming visual complexity.
Creating a Basic Cluster Map
Creating your first cluster map requires just one function call. Seaborn handles the clustering algorithm, distance calculations, and layout automatically.
# Select only numeric columns for clustering
numeric_data = iris_df.iloc[:, :4]
# Create a basic cluster map
sns.clustermap(numeric_data)
plt.show()
This produces a cluster map with dendrograms on the left (rows) and top (columns). The heatmap in the center uses color intensity to represent values. Notice how the rows have been reordered - samples with similar measurements are now adjacent, even though they were scattered in the original dataset.
The dendrograms show the clustering hierarchy. Items that merge lower (closer to the heatmap) are more similar to each other. Items that only connect near the top are quite different.
Customizing Cluster Map Appearance
Default settings rarely produce publication-ready visualizations. Here’s how to customize appearance for clarity and impact.
# Create a customized cluster map
sns.clustermap(
numeric_data,
cmap="RdYlBu_r", # Red-Yellow-Blue reversed colormap
figsize=(10, 8), # Larger figure size
dendrogram_ratio=0.15, # Smaller dendrograms
cbar_pos=(0.02, 0.8, 0.03, 0.15), # Colorbar position (x, y, width, height)
linewidths=0.5, # Add gridlines between cells
linecolor='gray',
xticklabels=True, # Show all column labels
yticklabels=False # Hide row labels (too many)
)
plt.show()
The cmap parameter accepts any matplotlib colormap. Diverging colormaps (like “RdYlBu_r”) work well when your data has a meaningful center point. Sequential colormaps (like “viridis”) work better for data that ranges from low to high without a natural midpoint.
The dendrogram_ratio controls how much space dendrograms occupy. Smaller values (0.1-0.2) give more room to the heatmap itself. The cbar_pos parameter takes a tuple of (x, y, width, height) in figure coordinates to position the colorbar precisely.
Controlling Clustering Behavior
The clustering algorithm and distance metric dramatically affect your results. Different combinations reveal different patterns.
# Compare different linkage methods
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
methods = ['ward', 'average', 'complete']
for idx, method in enumerate(methods):
plt.subplot(1, 3, idx + 1)
g = sns.clustermap(
numeric_data,
method=method,
metric='euclidean',
cmap='viridis',
figsize=(5, 5),
row_cluster=True,
col_cluster=True
)
plt.title(f'Linkage: {method}')
plt.tight_layout()
plt.show()
# Try different distance metrics
sns.clustermap(
numeric_data,
method='ward',
metric='correlation', # Use correlation distance
cmap='coolwarm',
figsize=(10, 8)
)
plt.show()
Ward linkage minimizes variance within clusters and typically produces the most balanced dendrograms. Average linkage uses the mean distance between all pairs of points in two clusters. Complete linkage uses the maximum distance between any two points, making it sensitive to outliers.
For distance metrics, euclidean works well for continuous data where absolute differences matter. Correlation distance (1 - correlation coefficient) is better when you care about patterns rather than magnitudes - two features with identical shapes but different scales will be close together.
Set row_cluster=False or col_cluster=False to disable clustering on specific axes when you want to preserve the original ordering.
Advanced Techniques
Real-world applications often require additional context and data preprocessing. Here’s how to add categorical annotations and normalize your data.
# Standardize the data (z-score normalization)
from scipy.stats import zscore
numeric_standardized = numeric_data.apply(zscore)
# Create color mapping for species
species_colors = {
'setosa': '#e74c3c',
'versicolor': '#3498db',
'virginica': '#2ecc71'
}
# Map species to colors for row annotations
row_colors = iris_df['species'].map(species_colors)
# Create cluster map with annotations
g = sns.clustermap(
numeric_standardized,
method='ward',
metric='euclidean',
cmap='RdBu_r',
center=0, # Center colormap at 0 for standardized data
figsize=(10, 10),
row_colors=row_colors,
dendrogram_ratio=0.15,
cbar_kws={'label': 'Z-score'},
xticklabels=True,
yticklabels=False
)
# Add a legend for species colors
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor=color, label=species)
for species, color in species_colors.items()
]
g.ax_heatmap.legend(
handles=legend_elements,
loc='upper left',
bbox_to_anchor=(1.2, 1),
frameon=True
)
plt.show()
Standardization is critical when your features have different scales. Without it, features with larger ranges dominate the distance calculations. The z-score transformation ensures each feature contributes equally to clustering.
The row_colors parameter adds a colored bar along the side showing categorical information. This lets you see whether your clustering successfully separated known groups. In this example, you’ll notice the three iris species cluster separately, validating that the measurements contain species-specific information.
Practical Use Cases and Best Practices
Let’s examine a realistic scenario: analyzing customer purchase behavior across product categories.
# Create a synthetic customer-product dataset
np.random.seed(42)
customers = [f'Customer_{i}' for i in range(50)]
products = ['Electronics', 'Clothing', 'Food', 'Books', 'Sports']
# Generate purchase amounts with some patterns
data = np.random.gamma(2, 2, size=(50, 5)) * 100
# Add patterns: some customers prefer certain categories
data[0:15, 0] *= 3 # Tech enthusiasts
data[15:30, 1] *= 3 # Fashion buyers
data[30:45, 2:4] *= 2 # Food & book lovers
purchase_df = pd.DataFrame(data, index=customers, columns=products)
# Create cluster map with proper styling
g = sns.clustermap(
purchase_df,
method='ward',
metric='euclidean',
cmap='YlOrRd',
figsize=(10, 12),
dendrogram_ratio=0.1,
linewidths=0.5,
linecolor='white',
cbar_kws={'label': 'Purchase Amount ($)'},
xticklabels=True,
yticklabels=True,
fmt='.0f'
)
g.ax_heatmap.set_xlabel('Product Category', fontsize=12)
g.ax_heatmap.set_ylabel('Customer', fontsize=12)
plt.show()
This visualization immediately reveals customer segments. You’ll see clusters of customers with similar purchasing patterns, helping you target marketing campaigns or recommend products.
Best practices for large datasets:
-
Subsample intelligently: For datasets with thousands of rows, cluster on a representative sample first, then assign remaining points to clusters.
-
Choose metrics carefully: Correlation distance works well for time series and expression data. Euclidean distance suits spatial or measurement data.
-
Standardize when mixing scales: Always normalize when combining features with different units (dollars and percentages, for example).
-
Interpret dendrograms: The height where branches merge indicates dissimilarity. Large jumps suggest natural cluster boundaries.
-
Validate results: Use domain knowledge to verify that clusters make sense. If they don’t, try different distance metrics or linkage methods.
Cluster maps excel at exploratory analysis. They help you generate hypotheses about your data structure, identify outliers, and communicate complex patterns visually. Master this tool, and you’ll find applications across virtually every data analysis project.