How to Implement Hierarchical Clustering in R
Hierarchical clustering creates a tree of clusters rather than forcing you to specify the number of groups upfront. Unlike k-means, which requires you to choose k beforehand and can get stuck in...
Key Insights
- Hierarchical clustering builds tree-like structures (dendrograms) that reveal data relationships at multiple granularities, making it superior to k-means when you need to understand nested groupings or don’t know the cluster count beforehand.
- The choice of linkage method (Ward’s, complete, average, single) dramatically affects results—Ward’s minimizes within-cluster variance and typically produces the most balanced clusters for general use.
- Always scale your features before clustering and validate results with silhouette scores; hierarchical clustering is computationally expensive (O(n³)), so consider alternatives for datasets with more than 10,000 observations.
Introduction to Hierarchical Clustering
Hierarchical clustering creates a tree of clusters rather than forcing you to specify the number of groups upfront. Unlike k-means, which requires you to choose k beforehand and can get stuck in local optima, hierarchical methods produce a dendrogram showing relationships at every level of granularity.
There are two approaches: agglomerative (bottom-up) starts with each observation as its own cluster and progressively merges similar ones, while divisive (top-down) begins with everything in one cluster and recursively splits. Agglomerative is far more common and what R’s built-in functions implement.
Use hierarchical clustering when you need to understand the nested structure of your data, when the number of clusters isn’t obvious, or when you’re working with smaller datasets (under 10,000 rows). For large datasets or when you know exactly how many clusters you want, k-means is faster. For non-spherical clusters, consider DBSCAN instead.
Preparing Your Data
Data preprocessing determines whether your clustering results will be meaningful or garbage. The two critical steps are handling missing values and scaling features.
# Load necessary libraries
library(cluster)
library(factoextra)
# Load the USArrests dataset
data("USArrests")
df <- USArrests
# Check the structure and summary
str(df)
summary(df)
# Check for missing values
sum(is.na(df))
If you have missing values, either remove those rows with na.omit() or impute them using the mean, median, or more sophisticated methods like k-NN imputation. For clustering, I prefer removing incomplete cases unless you’re losing too much data.
Scaling is non-negotiable. Features with larger ranges will dominate distance calculations. If one variable ranges from 0-1000 and another from 0-1, the algorithm will essentially ignore the smaller variable.
# Scale the data - critical for clustering
df_scaled <- scale(df)
# Verify scaling worked (mean ≈ 0, sd ≈ 1)
colMeans(df_scaled)
apply(df_scaled, 2, sd)
# Convert back to data frame for easier handling
df_scaled <- as.data.frame(df_scaled)
The scale() function standardizes each column to have mean 0 and standard deviation 1. This puts all variables on equal footing.
Computing Distance Matrices
Hierarchical clustering requires a distance matrix showing how dissimilar each pair of observations is. The distance metric you choose affects your results significantly.
Euclidean distance works well for continuous variables when the straight-line distance makes sense. Manhattan distance (sum of absolute differences) is more robust to outliers. Correlation-based distance works when you care about the shape of profiles rather than their magnitude—common in gene expression analysis.
# Calculate distance matrices with different metrics
dist_euclidean <- dist(df_scaled, method = "euclidean")
dist_manhattan <- dist(df_scaled, method = "manhattan")
# For correlation-based distance
dist_correlation <- as.dist(1 - cor(t(df_scaled)))
# Examine the distance matrix structure
as.matrix(dist_euclidean)[1:5, 1:5]
# Visualize the distance matrix as a heatmap
heatmap(as.matrix(dist_euclidean),
main = "Distance Matrix Heatmap",
symm = TRUE)
The heatmap visualization helps you spot natural groupings before clustering. Blocks of similar colors suggest observations that cluster together.
Building the Hierarchical Cluster
The hclust() function performs agglomerative clustering. The linkage method determines how cluster distances are calculated when merging groups.
- Single linkage: Distance between closest points (tends to create long chains)
- Complete linkage: Distance between farthest points (creates compact clusters)
- Average linkage: Average distance between all pairs
- Ward’s method: Minimizes within-cluster variance (usually best for balanced clusters)
# Apply different linkage methods
hc_complete <- hclust(dist_euclidean, method = "complete")
hc_average <- hclust(dist_euclidean, method = "average")
hc_single <- hclust(dist_euclidean, method = "single")
hc_ward <- hclust(dist_euclidean, method = "ward.D2")
# Compare cluster heights (larger jumps indicate natural breaks)
tail(hc_ward$height)
Ward’s method (ward.D2) is my default choice. It produces the most interpretable, balanced clusters for most applications. Single linkage tends to create straggly, unbalanced clusters. Complete and average are reasonable alternatives.
Visualizing Dendrograms
Dendrograms are the primary tool for interpreting hierarchical clustering results. The height of each branch point shows the distance at which clusters merged.
# Basic dendrogram
plot(hc_ward,
main = "Dendrogram - Ward's Method",
xlab = "States",
ylab = "Height",
cex = 0.7)
# Add rectangles around clusters
rect.hclust(hc_ward, k = 4, border = 2:5)
# Using dendextend for better visualizations
library(dendextend)
dend <- as.dendrogram(hc_ward)
dend <- color_branches(dend, k = 4)
dend <- color_labels(dend, k = 4)
plot(dend, main = "Colored Dendrogram (4 Clusters)")
The dendextend package provides powerful customization options. Coloring branches by cluster assignment makes the structure immediately clear. Look for large vertical jumps in the dendrogram—these indicate natural separations in your data.
Determining Optimal Clusters and Cutting the Tree
Deciding where to cut the dendrogram is part art, part science. Three approaches work well in practice.
First, visual inspection of the dendrogram. Look for the largest vertical distances without horizontal lines crossing them. This suggests natural groupings.
Second, the elbow method plots within-cluster sum of squares against the number of clusters. Choose the point where adding more clusters shows diminishing returns.
Third, silhouette analysis measures how well each observation fits its assigned cluster versus other clusters. Higher average silhouette scores indicate better clustering.
# Cut tree to create cluster assignments
clusters_4 <- cutree(hc_ward, k = 4)
clusters_3 <- cutree(hc_ward, k = 3)
# Add cluster assignments to original data
df$cluster_4 <- clusters_4
# Silhouette analysis
sil_4 <- silhouette(clusters_4, dist_euclidean)
sil_3 <- silhouette(clusters_3, dist_euclidean)
# Plot silhouette scores
fviz_silhouette(sil_4)
# Average silhouette width
mean(sil_4[, 3])
mean(sil_3[, 3])
# Elbow method using within-cluster sum of squares
wss <- sapply(2:10, function(k) {
clusters <- cutree(hc_ward, k = k)
sum(sapply(unique(clusters), function(cluster) {
sum(dist_euclidean[clusters == cluster]^2)
}))
})
plot(2:10, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within-Cluster Sum of Squares",
main = "Elbow Method")
Silhouette scores range from -1 to 1. Values above 0.5 indicate good clustering, while negative values suggest observations are in the wrong cluster. If your average silhouette is below 0.25, reconsider whether hierarchical clustering is appropriate for your data.
Practical Application and Validation
Let’s apply everything to the USArrests dataset, which contains crime statistics for US states. We’ll identify groups of states with similar crime profiles.
# Complete workflow
data("USArrests")
df <- USArrests
# Preprocess
df_scaled <- scale(df)
# Compute distance and cluster
dist_matrix <- dist(df_scaled, method = "euclidean")
hc <- hclust(dist_matrix, method = "ward.D2")
# Determine optimal clusters (let's use 4 based on dendrogram)
clusters <- cutree(hc, k = 4)
# Validate with silhouette
sil <- silhouette(clusters, dist_matrix)
cat("Average silhouette width:", mean(sil[, 3]), "\n")
# Add clusters to original data
df$cluster <- clusters
# Examine cluster characteristics
aggregate(df[, 1:4], by = list(Cluster = df$cluster), mean)
# Visualize with PCA
pca <- prcomp(df_scaled)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$cluster <- as.factor(clusters)
library(ggplot2)
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster, label = rownames(df))) +
geom_point(size = 3) +
geom_text(vjust = -0.5, size = 3) +
theme_minimal() +
labs(title = "Hierarchical Clustering Results (PCA Projection)",
color = "Cluster")
# Examine specific clusters
print("States in Cluster 1 (low crime):")
rownames(df)[df$cluster == 1]
print("States in Cluster 4 (high crime):")
rownames(df)[df$cluster == 4]
The cluster means reveal distinct profiles. One cluster might represent low-crime rural states, another high-crime urban states, and others falling in between. The PCA visualization shows separation between clusters in two-dimensional space—good separation validates that your clusters are meaningful.
Always interpret clusters in the context of your domain knowledge. Statistical validity doesn’t guarantee practical utility. Ask whether the groupings make sense and provide actionable insights for your specific problem.
For production applications, document your preprocessing steps, distance metric, linkage method, and the rationale for your chosen number of clusters. Hierarchical clustering involves many decisions, and reproducibility requires recording them all.