How to Implement Hierarchical Clustering in Python

Key Insights

Hierarchical clustering builds a tree of clusters without requiring you to specify the number of clusters upfront, making it ideal for exploratory data analysis where cluster count is unknown
The choice of linkage method (Ward, complete, average, single) dramatically affects cluster shape and quality—Ward’s method generally produces the most balanced clusters for most applications
While hierarchical clustering scales poorly (O(n²) time and space complexity), it provides interpretable dendrograms that reveal data structure at multiple granularities simultaneously

Introduction to Hierarchical Clustering

Hierarchical clustering builds a tree-like structure of nested clusters, offering a significant advantage over K-means: you don’t need to specify the number of clusters beforehand. Instead, you get a complete hierarchy that you can cut at any level to obtain your desired number of clusters.

There are two approaches: agglomerative (bottom-up) starts with each point as its own cluster and merges similar ones, while divisive (top-down) starts with one cluster and splits it. Agglomerative is far more common and what we’ll focus on.

This technique excels in customer segmentation where you want to understand relationships between groups, document clustering where hierarchical topic structures matter, and bioinformatics for building taxonomies from gene expression data.

Let’s compare hierarchical clustering with K-means on the same dataset:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering

# Generate sample data with clear clusters
X, y_true = make_blobs(n_samples=300, centers=4, n_features=2, 
                       cluster_std=0.60, random_state=42)

# K-means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=4, linkage='ward')
y_hierarchical = hierarchical.fit_predict(X)

# Visualize both
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
axes[0].set_title('K-Means Clustering')
axes[1].scatter(X[:, 0], X[:, 1], c=y_hierarchical, cmap='viridis')
axes[1].set_title('Hierarchical Clustering')
plt.tight_layout()
plt.show()

Understanding Distance Metrics and Linkage Methods

The distance metric determines how similarity between points is calculated. Euclidean distance works for most numerical data, Manhattan distance is robust to outliers, and cosine similarity is ideal for high-dimensional data like text where magnitude matters less than direction.

More critical is the linkage method, which defines how distance between clusters (not just points) is calculated:

Single linkage: Distance between closest points in clusters (creates long chains)
Complete linkage: Distance between farthest points (creates compact clusters)
Average linkage: Average distance between all point pairs (balanced approach)
Ward’s method: Minimizes within-cluster variance (produces even-sized clusters)

Ward’s method is the default choice for most applications because it creates well-balanced, spherical clusters. Here’s how different linkage methods affect the same dataset:

from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

# Create data with elongated clusters
np.random.seed(42)
X = np.concatenate([
    np.random.randn(50, 2) * [2, 0.5] + [0, 0],
    np.random.randn(50, 2) * [2, 0.5] + [6, 3]
])

linkage_methods = ['single', 'complete', 'average', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for idx, method in enumerate(linkage_methods):
    Z = linkage(X, method=method)
    
    # Create dendrogram
    ax = axes[idx // 2, idx % 2]
    dendrogram(Z, ax=ax, no_labels=True)
    ax.set_title(f'{method.capitalize()} Linkage')
    ax.set_xlabel('Sample Index')
    ax.set_ylabel('Distance')

plt.tight_layout()
plt.show()

Implementing Hierarchical Clustering with SciPy

SciPy provides low-level control over hierarchical clustering through scipy.cluster.hierarchy. The workflow involves computing a linkage matrix that encodes the hierarchical structure, then visualizing it with a dendrogram.

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import pandas as pd

# Load and prepare data
iris = load_iris()
X = iris.data
feature_names = iris.feature_names

# Always scale your features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compute linkage matrix using Ward's method
Z = linkage(X_scaled, method='ward', metric='euclidean')

# Create dendrogram
plt.figure(figsize=(12, 6))
dendrogram(Z, 
           labels=iris.target,
           leaf_font_size=8,
           color_threshold=7.5)
plt.title('Hierarchical Clustering Dendrogram (Iris Dataset)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.axhline(y=7.5, color='r', linestyle='--', label='Cut threshold')
plt.legend()
plt.show()

# The linkage matrix Z has shape (n-1, 4) where each row is:
# [cluster_1, cluster_2, distance, sample_count]
print(f"Linkage matrix shape: {Z.shape}")
print(f"Last 5 merges:\n{Z[-5:]}")

Using Scikit-learn’s AgglomerativeClustering

Scikit-learn’s AgglomerativeClustering provides a cleaner API that integrates with the sklearn ecosystem. Use this when you want a specific number of clusters and need to predict on new data (though hierarchical clustering isn’t typically used for prediction).

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Fit hierarchical clustering
n_clusters = 3
agg_clustering = AgglomerativeClustering(
    n_clusters=n_clusters,
    linkage='ward',
    affinity='euclidean'
)

labels = agg_clustering.fit_predict(X_scaled)

# Evaluate clustering quality
silhouette_avg = silhouette_score(X_scaled, labels)
davies_bouldin = davies_bouldin_score(X_scaled, labels)

print(f"Silhouette Score: {silhouette_avg:.3f}")
print(f"Davies-Bouldin Index: {davies_bouldin:.3f}")

# Visualize results (using PCA for 2D projection)
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.colorbar(scatter)
plt.title(f'Hierarchical Clustering Results (k={n_clusters})')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.show()

Determining Optimal Number of Clusters

Unlike K-means where the elbow method is standard, hierarchical clustering offers visual guidance through dendrograms. Look for large vertical gaps—these indicate natural cutting points.

Quantitatively, use silhouette analysis or the cophenetic correlation coefficient, which measures how faithfully the dendrogram preserves pairwise distances:

from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

# Compute cophenetic correlation
original_distances = pdist(X_scaled)
Z = linkage(X_scaled, method='ward')
cophenetic_distances, _ = cophenet(Z, original_distances)
print(f"Cophenetic correlation: {cophenetic_distances:.3f}")

# Silhouette analysis for different cluster counts
silhouette_scores = []
cluster_range = range(2, 11)

for n_clusters in cluster_range:
    clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
    labels = clusterer.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(score)

# Plot silhouette scores
plt.figure(figsize=(10, 5))
plt.plot(cluster_range, silhouette_scores, 'bo-')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis for Optimal Clusters')
plt.grid(True)
plt.show()

# Cut dendrogram at specific distance
distance_threshold = 7.5
clusters = fcluster(Z, distance_threshold, criterion='distance')
print(f"Cutting at distance {distance_threshold} produces {len(np.unique(clusters))} clusters")

Real-World Application: Customer Segmentation

Let’s build a complete customer segmentation pipeline. We’ll create synthetic customer data and cluster based on purchasing behavior:

# Generate synthetic customer data
np.random.seed(42)
n_customers = 500

customer_data = pd.DataFrame({
    'annual_income': np.random.normal(50000, 20000, n_customers),
    'spending_score': np.random.randint(1, 100, n_customers),
    'age': np.random.randint(18, 70, n_customers),
    'purchase_frequency': np.random.poisson(10, n_customers),
    'avg_transaction': np.random.gamma(2, 50, n_customers)
})

# Feature engineering
customer_data['income_per_age'] = customer_data['annual_income'] / customer_data['age']
customer_data['total_annual_spend'] = (customer_data['purchase_frequency'] * 
                                       customer_data['avg_transaction'])

# Prepare features
features = ['annual_income', 'spending_score', 'age', 
            'purchase_frequency', 'total_annual_spend']
X = customer_data[features].values

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform hierarchical clustering
Z = linkage(X_scaled, method='ward')

# Determine optimal clusters using dendrogram
plt.figure(figsize=(12, 6))
dendrogram(Z, no_labels=True, color_threshold=15)
plt.axhline(y=15, color='r', linestyle='--')
plt.title('Customer Segmentation Dendrogram')
plt.show()

# Apply clustering
n_clusters = 4
clustering = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
customer_data['cluster'] = clustering.fit_predict(X_scaled)

# Analyze clusters
cluster_summary = customer_data.groupby('cluster')[features].mean()
print("\nCluster Characteristics:")
print(cluster_summary.round(2))

# Visualize with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 6))
for cluster in range(n_clusters):
    mask = customer_data['cluster'] == cluster
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], 
                label=f'Cluster {cluster}', alpha=0.6)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title('Customer Segments')
plt.legend()
plt.show()

Performance Considerations and Best Practices

Hierarchical clustering has O(n²) space complexity and O(n³) time complexity for the naive implementation, making it impractical for datasets beyond 10,000 samples. SciPy uses optimized algorithms that reduce time to O(n²), but memory remains a constraint.

For large datasets, consider these strategies:

Sample your data: Cluster a representative sample, then assign remaining points
Use mini-batch approaches: Divide data into batches, cluster separately, then merge
Switch algorithms: For >10k samples, consider DBSCAN or HDBSCAN instead

import time

# Benchmark different dataset sizes
sizes = [100, 500, 1000, 2000]
times = []

for size in sizes:
    X_test = np.random.randn(size, 10)
    
    start = time.time()
    Z = linkage(X_test, method='ward')
    elapsed = time.time() - start
    times.append(elapsed)
    
    print(f"Size {size}: {elapsed:.3f} seconds")

# Plot scaling behavior
plt.figure(figsize=(8, 5))
plt.plot(sizes, times, 'ro-')
plt.xlabel('Dataset Size')
plt.ylabel('Time (seconds)')
plt.title('Hierarchical Clustering Scaling')
plt.grid(True)
plt.show()

Best practices summary:

Always standardize your features before clustering
Use Ward’s linkage for general-purpose clustering
Validate results with silhouette scores and domain knowledge
Keep datasets under 5,000 samples for interactive analysis
Save your linkage matrix—it’s expensive to recompute

Hierarchical clustering shines when you need interpretable cluster relationships and don’t know the optimal cluster count upfront. The dendrogram provides intuition about data structure that flat clustering methods simply can’t match.