How to Implement Agglomerative Clustering in Python

Key Insights

Agglomerative clustering builds hierarchies bottom-up, making it ideal when you need to explore data at multiple granularity levels or don’t know the optimal cluster count beforehand
Linkage method choice dramatically affects results: Ward minimizes variance (compact clusters), complete linkage avoids chaining, while single linkage finds elongated clusters
The algorithm’s O(n³) complexity makes it impractical for datasets beyond 10,000 samples without connectivity constraints or approximate methods

Understanding Agglomerative Clustering

Agglomerative clustering takes a bottom-up approach to hierarchical clustering. It starts by treating each data point as its own cluster, then iteratively merges the closest pairs until all points belong to a single cluster. This creates a tree structure (dendrogram) that you can cut at any height to obtain different numbers of clusters.

Use agglomerative clustering when you need flexibility in cluster count, want to visualize hierarchical relationships in your data, or have small to medium datasets (under 10,000 samples). Unlike K-means, which requires specifying cluster count upfront and assumes spherical clusters, agglomerative clustering discovers hierarchical structure and handles arbitrary cluster shapes better.

The tradeoff is computational cost. While K-means runs in O(n) per iteration, agglomerative clustering requires O(n³) time and O(n²) memory in the naive implementation, making it unsuitable for large datasets without modifications.

Linkage Methods: The Core Decision

Linkage criteria determine how to measure distance between clusters. This choice fundamentally shapes your results.

Single linkage measures distance as the minimum distance between any two points in different clusters. It tends to create elongated, chain-like clusters and is sensitive to outliers.

Complete linkage uses the maximum distance between points in different clusters. It produces compact, roughly spherical clusters but can break large clusters prematurely.

Average linkage takes the mean distance between all pairs of points across clusters. It balances the extremes of single and complete linkage.

Ward’s method minimizes the within-cluster variance when merging. It strongly prefers creating equal-sized, compact clusters and is the most commonly used method in practice.

Here’s how different linkage methods affect the same dataset:

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs

# Create sample data with clear structure
X, _ = make_blobs(n_samples=50, centers=3, n_features=2, 
                  cluster_std=0.5, random_state=42)

linkage_methods = ['single', 'complete', 'average', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for idx, method in enumerate(linkage_methods):
    ax = axes[idx // 2, idx % 2]
    Z = linkage(X, method=method)
    dendrogram(Z, ax=ax)
    ax.set_title(f'{method.capitalize()} Linkage', fontsize=14)
    ax.set_xlabel('Sample Index')
    ax.set_ylabel('Distance')

plt.tight_layout()
plt.savefig('linkage_comparison.png', dpi=300)
plt.show()

Notice how single linkage creates long chains in the dendrogram, while Ward produces more balanced trees. This visual difference translates directly to cluster shape.

Basic Implementation with Scikit-learn

Scikit-learn’s AgglomerativeClustering provides a clean interface. Here’s a complete implementation using the Iris dataset:

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# Load and prepare data
iris = load_iris()
X = iris.data
y_true = iris.target

# Scale features - critical for distance-based algorithms
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit agglomerative clustering
agg_clustering = AgglomerativeClustering(
    n_clusters=3,
    linkage='ward',
    metric='euclidean'
)
y_pred = agg_clustering.fit_predict(X_scaled)

# Visualize results (using first two features)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# True labels
scatter1 = ax1.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', 
                       edgecolors='black', s=100)
ax1.set_title('True Labels', fontsize=14)
ax1.set_xlabel('Sepal Length')
ax1.set_ylabel('Sepal Width')

# Predicted clusters
scatter2 = ax2.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis',
                       edgecolors='black', s=100)
ax2.set_title('Agglomerative Clustering', fontsize=14)
ax2.set_xlabel('Sepal Length')
ax2.set_ylabel('Sepal Width')

plt.tight_layout()
plt.savefig('iris_clustering.png', dpi=300)
plt.show()

print(f"Cluster assignments: {np.bincount(y_pred)}")

Always scale your features before clustering. Distance-based algorithms are sensitive to feature magnitude, and unscaled data will weight high-variance features disproportionately.

Dendrograms: Visualizing Hierarchy

Dendrograms show the complete merge history. The y-axis represents distance at which clusters merge. Large vertical gaps indicate natural cluster boundaries.

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt

# Create linkage matrix
Z = linkage(X_scaled, method='ward')

# Plot dendrogram
plt.figure(figsize=(12, 6))
dendrogram(Z, truncate_mode='lastp', p=30, show_leaf_counts=True)
plt.title('Dendrogram - Ward Linkage', fontsize=16)
plt.xlabel('Cluster Size', fontsize=12)
plt.ylabel('Distance', fontsize=12)

# Add horizontal line to show cut point
plt.axhline(y=7, color='r', linestyle='--', label='Cut at distance=7')
plt.legend()
plt.savefig('dendrogram_cut.png', dpi=300)
plt.show()

# Extract clusters at specific distance
clusters_at_7 = fcluster(Z, t=7, criterion='distance')
print(f"Number of clusters at distance 7: {len(np.unique(clusters_at_7))}")

# Or extract specific number of clusters
clusters_3 = fcluster(Z, t=3, criterion='maxclust')
print(f"Cluster distribution: {np.bincount(clusters_3)}")

Look for the longest vertical lines in the dendrogram—these represent the most significant cluster separations. Cut just below these long lines to determine optimal cluster count.

Distance Metrics Matter

The affinity parameter controls distance calculation. Different metrics suit different data types:

from sklearn.metrics import silhouette_score
from sklearn.feature_extraction.text import TfidfVectorizer

# Simulate text clustering scenario
documents = [
    "machine learning algorithms",
    "deep learning neural networks",
    "supervised learning methods",
    "database management systems",
    "sql query optimization",
    "relational database design"
]

# Convert to TF-IDF features
vectorizer = TfidfVectorizer()
X_text = vectorizer.fit_transform(documents).toarray()

metrics = ['euclidean', 'manhattan', 'cosine']
results = {}

for metric in metrics:
    if metric == 'cosine':
        # Cosine requires affinity='cosine' and linkage='average'
        model = AgglomerativeClustering(
            n_clusters=2, 
            affinity='cosine', 
            linkage='average'
        )
    else:
        model = AgglomerativeClustering(
            n_clusters=2,
            affinity=metric,
            linkage='ward' if metric == 'euclidean' else 'complete'
        )
    
    labels = model.fit_predict(X_text)
    score = silhouette_score(X_text, labels)
    results[metric] = score
    print(f"{metric.capitalize()}: Silhouette Score = {score:.3f}")
    print(f"  Cluster 0: {np.where(labels == 0)[0]}")
    print(f"  Cluster 1: {np.where(labels == 1)[0]}\n")

Euclidean works for continuous numerical features. Manhattan is robust to outliers. Cosine is essential for high-dimensional sparse data like text, where angle matters more than magnitude.

Real-World Application: Customer Segmentation

Here’s a complete pipeline for segmenting customers based on purchase behavior:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

# Simulate customer data
np.random.seed(42)
n_customers = 200

customer_data = pd.DataFrame({
    'customer_id': range(n_customers),
    'total_purchases': np.random.randint(1, 50, n_customers),
    'avg_order_value': np.random.uniform(20, 200, n_customers),
    'days_since_last_purchase': np.random.randint(1, 365, n_customers),
    'returns': np.random.randint(0, 10, n_customers)
})

# Prepare features
feature_cols = ['total_purchases', 'avg_order_value', 
                'days_since_last_purchase', 'returns']
X = customer_data[feature_cols].values

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cluster customers
agg_model = AgglomerativeClustering(n_clusters=4, linkage='ward')
customer_data['segment'] = agg_model.fit_predict(X_scaled)

# Analyze segments
segment_summary = customer_data.groupby('segment')[feature_cols].agg([
    'mean', 'median', 'count'
])

print("Customer Segment Characteristics:")
print(segment_summary.round(2))

# Create business-friendly labels
def label_segment(row):
    if row['total_purchases'] > 30 and row['days_since_last_purchase'] < 60:
        return 'High Value Active'
    elif row['total_purchases'] < 10 and row['days_since_last_purchase'] > 180:
        return 'At Risk'
    elif row['avg_order_value'] > 150:
        return 'Premium'
    else:
        return 'Standard'

segment_profiles = customer_data.groupby('segment').apply(
    lambda x: label_segment(x[feature_cols].mean())
)
print("\nSegment Labels:")
print(segment_profiles)

This approach lets you identify distinct customer groups for targeted marketing. Always interpret clusters in business context—statistical separation doesn’t guarantee actionable insights.

Performance Considerations

Agglomerative clustering doesn’t scale well. Here’s the reality:

import time
from sklearn.datasets import make_blobs

sample_sizes = [100, 500, 1000, 2000]
times = []

for n in sample_sizes:
    X, _ = make_blobs(n_samples=n, n_features=10, random_state=42)
    
    start = time.time()
    AgglomerativeClustering(n_clusters=5).fit(X)
    elapsed = time.time() - start
    
    times.append(elapsed)
    print(f"n={n}: {elapsed:.3f} seconds")

# Time grows cubically
# n=100: ~0.02s, n=2000: ~5s, n=10000: ~300s (5 minutes)

For large datasets, use connectivity constraints to impose structure:

from sklearn.neighbors import kneighbors_graph

# Limit merges to spatially nearby points
X, _ = make_blobs(n_samples=1000, n_features=2, random_state=42)

# Create connectivity graph (only nearby points can merge)
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)

model = AgglomerativeClustering(
    n_clusters=3,
    connectivity=connectivity,
    linkage='ward'
)
labels = model.fit_predict(X)

Connectivity constraints reduce complexity and enforce spatial coherence. Use them for image segmentation or geographic clustering where nearby points should cluster together.

Best practices: Scale features, start with Ward linkage, use dendrograms to determine cluster count, validate with silhouette scores, and switch to K-means or HDBSCAN for datasets over 10,000 samples. Agglomerative clustering excels at revealing hierarchical structure in moderately-sized datasets where interpretability matters.