K-Means Clustering: Complete Guide with Examples

K-Means is the workhorse of unsupervised learning. It's simple, fast, and effective for partitioning data into distinct groups without labeled training data. Unlike classification algorithms that...

Key Insights

  • K-Means iteratively assigns data points to the nearest centroid and recalculates centroids until convergence, making it fast and scalable with O(n·k·i·d) complexity where n is samples, k is clusters, i is iterations, and d is dimensions
  • The algorithm’s main weakness is sensitivity to initialization and assumption of spherical clusters—use K-Means++ initialization and scale your features to avoid poor results
  • Choosing k requires the Elbow Method or Silhouette analysis; there’s no single “correct” answer, and domain knowledge should guide your final decision

Introduction to K-Means Clustering

K-Means is the workhorse of unsupervised learning. It’s simple, fast, and effective for partitioning data into distinct groups without labeled training data. Unlike classification algorithms that learn from examples, K-Means discovers hidden patterns by grouping similar data points together.

The algorithm shines in three main areas: customer segmentation (grouping users by behavior), image compression (reducing colors while maintaining visual quality), and anomaly detection (identifying outliers that don’t fit any cluster). E-commerce companies use it to create targeted marketing campaigns, while image processing applications leverage it for efficient storage.

Choose K-Means when you have numerical data, roughly spherical clusters, and need fast results. Skip it for hierarchical relationships (use DBSCAN or hierarchical clustering instead) or when clusters have vastly different densities. K-Means assumes all clusters are equally important and similarly sized—a critical limitation.

How K-Means Works: The Algorithm Explained

K-Means follows four steps repeatedly until convergence:

  1. Initialization: Select k random points as initial centroids
  2. Assignment: Assign each data point to its nearest centroid
  3. Update: Recalculate centroids as the mean of assigned points
  4. Convergence: Repeat steps 2-3 until centroids stop moving significantly

The centroid is literally the center of mass—the average position of all points in a cluster. When you reassign points based on new centroids, some points switch clusters, which shifts centroids again. This continues until the system stabilizes.

Time complexity is O(n·k·i·d) where n is the number of samples, k is clusters, i is iterations, and d is dimensions. This scales linearly with data size, making K-Means suitable for large datasets. Most implementations converge in fewer than 100 iterations.

import numpy as np
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) + [2, 2],
    np.random.randn(100, 2) + [-2, -2],
    np.random.randn(100, 2) + [2, -2]
])

# Visualize K-Means iterations
def plot_iteration(X, centroids, labels, iteration):
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
    plt.scatter(centroids[:, 0], centroids[:, 1], 
                c='red', marker='X', s=200, edgecolors='black')
    plt.title(f'Iteration {iteration}')
    plt.show()

# Initialize centroids randomly
k = 3
centroids = X[np.random.choice(X.shape[0], k, replace=False)]

# Show first iteration
distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
labels = np.argmin(distances, axis=0)
plot_iteration(X, centroids, labels, 0)

Implementing K-Means from Scratch

Building K-Means yourself reveals how simple yet powerful the algorithm is. Here’s a complete implementation:

import numpy as np

class KMeans:
    def __init__(self, n_clusters=3, max_iters=100, tol=1e-4):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.tol = tol
        self.centroids = None
        
    def fit(self, X):
        # Initialize centroids randomly
        idx = np.random.choice(len(X), self.n_clusters, replace=False)
        self.centroids = X[idx].copy()
        
        for i in range(self.max_iters):
            # Assign points to nearest centroid
            labels = self._assign_clusters(X)
            
            # Calculate new centroids
            new_centroids = np.array([
                X[labels == k].mean(axis=0) 
                for k in range(self.n_clusters)
            ])
            
            # Check convergence
            if np.all(np.abs(new_centroids - self.centroids) < self.tol):
                break
                
            self.centroids = new_centroids
            
        return self
    
    def _assign_clusters(self, X):
        # Euclidean distance to each centroid
        distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
        return np.argmin(distances, axis=0)
    
    def predict(self, X):
        return self._assign_clusters(X)
    
    def fit_predict(self, X):
        self.fit(X)
        return self.predict(X)

# Test implementation
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)
print(f"Final centroids:\n{kmeans.centroids}")

This implementation uses Euclidean distance, the standard choice for K-Means. Manhattan distance (sum of absolute differences) works for specific cases but changes the algorithm’s behavior. The convergence criterion checks if centroids move less than a tolerance threshold.

K-Means with Scikit-Learn

Production code should use scikit-learn’s optimized implementation:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import pandas as pd

# Sample customer dataset
data = {
    'annual_income': [15000, 16000, 17000, 80000, 85000, 90000, 120000, 125000],
    'spending_score': [39, 81, 6, 77, 40, 76, 6, 94],
    'age': [19, 21, 20, 49, 48, 50, 60, 58]
}
df = pd.DataFrame(data)

# Scale features (critical for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Fit K-Means
kmeans = KMeans(
    n_clusters=3,
    init='k-means++',  # Smart initialization
    n_init=10,         # Run 10 times, keep best
    max_iter=300,
    random_state=42
)
labels = kmeans.fit_predict(X_scaled)

# Evaluate
print(f"Inertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Silhouette score: {silhouette_score(X_scaled, labels):.3f}")

# Add clusters to dataframe
df['cluster'] = labels
print(df.groupby('cluster').mean())

Key parameters: n_init runs the algorithm multiple times with different initializations and keeps the best result. init='k-means++' uses a smart initialization strategy that spreads initial centroids apart. Inertia measures how tight clusters are (lower is better), while silhouette score ranges from -1 to 1 (higher is better, above 0.5 is good).

Choosing the Right Number of Clusters

The hardest part of K-Means is choosing k. The Elbow Method plots inertia against k values and looks for an “elbow” where improvement slows:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Elbow Method
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, labels))

# Plot Elbow curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('Number of clusters (k)')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
ax1.grid(True)

ax2.plot(K_range, silhouette_scores, 'ro-')
ax2.set_xlabel('Number of clusters (k)')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
ax2.grid(True)

plt.tight_layout()
plt.show()

Look for the k where inertia stops dropping dramatically. Silhouette scores should be maximized. When these methods disagree, use domain knowledge. For customer segmentation, business constraints might dictate 3-5 segments regardless of mathematical optimality.

Common Pitfalls and Best Practices

Always scale your features. K-Means uses Euclidean distance, so features with larger ranges dominate. A salary in thousands will overwhelm an age in decades.

Random initialization causes inconsistent results. K-Means++ solves this by choosing initial centroids that are far apart:

# Compare random vs K-Means++ initialization
np.random.seed(42)
X_difficult = np.vstack([
    np.random.randn(50, 2) * 0.5 + [0, 0],
    np.random.randn(150, 2) * 0.3 + [3, 3]
])

# Random initialization (bad)
kmeans_random = KMeans(n_clusters=2, init='random', n_init=1, random_state=42)
labels_random = kmeans_random.fit_predict(X_difficult)

# K-Means++ initialization (good)
kmeans_pp = KMeans(n_clusters=2, init='k-means++', n_init=1, random_state=42)
labels_pp = kmeans_pp.fit_predict(X_difficult)

print(f"Random init inertia: {kmeans_random.inertia_:.2f}")
print(f"K-Means++ inertia: {kmeans_pp.inertia_:.2f}")

Outliers skew centroids. Remove obvious outliers before clustering or use DBSCAN, which handles outliers naturally.

K-Means assumes spherical clusters. It fails on elongated, ring-shaped, or irregularly shaped clusters. Use DBSCAN or Gaussian Mixture Models for complex shapes.

Advanced Applications and Variations

Mini-Batch K-Means processes random subsets of data per iteration, enabling clustering on millions of points:

from sklearn.cluster import MiniBatchKMeans

# For large datasets
mbk = MiniBatchKMeans(n_clusters=3, batch_size=100, random_state=42)
labels = mbk.fit_predict(X_scaled)

Image compression reduces colors by clustering pixel values:

from sklearn.cluster import KMeans
import numpy as np
from PIL import Image

# Load image
img = np.array(Image.open('photo.jpg'))
h, w, c = img.shape

# Reshape to 2D array of pixels
pixels = img.reshape(-1, 3)

# Cluster colors
kmeans = KMeans(n_clusters=16, random_state=42)
labels = kmeans.fit_predict(pixels)

# Replace each pixel with its cluster center
compressed = kmeans.cluster_centers_[labels].reshape(h, w, c).astype(np.uint8)

# Original: millions of colors, Compressed: 16 colors
print(f"Compression ratio: {pixels.shape[0] * 3 / (16 * 3):.1f}x")

This reduces a 16-million-color image to just 16 colors while maintaining recognizable structure. The compression ratio depends on image complexity and chosen k.

K-Means also works for feature engineering—cluster assignments become categorical features for supervised learning. Customer clusters might predict churn better than raw features.

The algorithm’s simplicity is its strength. Master K-Means initialization, feature scaling, and cluster evaluation, and you’ll solve 80% of clustering problems with 20% of the complexity of advanced methods.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.