How to Implement K-Means Clustering in Python

K-Means clustering is an unsupervised learning algorithm that partitions data into K distinct, non-overlapping groups. Each data point belongs to the cluster with the nearest mean (centroid), making...

Key Insights

  • K-Means clustering partitions data into K distinct groups by iteratively assigning points to the nearest centroid and updating centroids based on cluster means—simple yet powerful for many real-world applications.
  • Implementing K-Means from scratch reveals the algorithm’s mechanics, but scikit-learn’s optimized implementation with K-Means++ initialization should be your production choice.
  • The algorithm assumes spherical clusters of similar size and density; use the Elbow Method or Silhouette Score to determine optimal K, and consider alternatives like DBSCAN for complex cluster shapes.

Introduction to K-Means Clustering

K-Means clustering is an unsupervised learning algorithm that partitions data into K distinct, non-overlapping groups. Each data point belongs to the cluster with the nearest mean (centroid), making it one of the most straightforward and widely-used clustering techniques in machine learning.

The algorithm works through an iterative process: it starts with K randomly initialized centroids, assigns each data point to its nearest centroid, recalculates centroids based on the mean of assigned points, and repeats until convergence. This simplicity makes K-Means computationally efficient with O(nKi*d) complexity, where n is the number of points, K is the number of clusters, i is the number of iterations, and d is the number of dimensions.

Common use cases include customer segmentation for targeted marketing, image compression by reducing color palettes, document clustering for organizing large text corpora, and anomaly detection by identifying points far from any cluster center. The algorithm’s speed and scalability make it ideal for exploratory data analysis on large datasets.

Understanding the Algorithm Steps

K-Means follows four core steps that repeat until convergence:

Initialization: Select K initial centroids. Random selection works but can lead to poor results. K-Means++ (the default in scikit-learn) chooses initial centroids that are far apart, improving convergence and final cluster quality.

Assignment: Assign each data point to the nearest centroid using a distance metric (typically Euclidean distance). This creates K clusters.

Update: Calculate new centroids by taking the mean of all points assigned to each cluster.

Convergence Check: If centroids haven’t moved significantly or maximum iterations are reached, stop. Otherwise, repeat assignment and update steps.

Here’s the conceptual flow:

# Pseudocode for K-Means
initialize K centroids randomly
while not converged:
    # Assignment step
    for each data point:
        assign to nearest centroid
    
    # Update step
    for each cluster:
        centroid = mean of all points in cluster
    
    # Check convergence
    if centroids haven't changed:
        break

The algorithm minimizes within-cluster sum of squares (inertia), making it a form of variance minimization. However, it’s not guaranteed to find the global optimum—results depend on initialization.

K-Means from Scratch with NumPy

Building K-Means from scratch solidifies understanding of the mechanics. Here’s a complete implementation:

import numpy as np
import matplotlib.pyplot as plt

def euclidean_distance(x1, x2):
    """Calculate Euclidean distance between two points."""
    return np.sqrt(np.sum((x1 - x2) ** 2, axis=1))

def initialize_centroids(X, k):
    """Randomly select k data points as initial centroids."""
    indices = np.random.choice(X.shape[0], k, replace=False)
    return X[indices]

def assign_clusters(X, centroids):
    """Assign each point to the nearest centroid."""
    distances = np.array([euclidean_distance(X, centroid) for centroid in centroids])
    return np.argmin(distances, axis=0)

def update_centroids(X, labels, k):
    """Calculate new centroids as the mean of assigned points."""
    centroids = np.zeros((k, X.shape[1]))
    for i in range(k):
        cluster_points = X[labels == i]
        if len(cluster_points) > 0:
            centroids[i] = cluster_points.mean(axis=0)
    return centroids

def kmeans_from_scratch(X, k, max_iters=100, tolerance=1e-4):
    """Complete K-Means implementation."""
    centroids = initialize_centroids(X, k)
    
    for iteration in range(max_iters):
        # Assignment step
        labels = assign_clusters(X, centroids)
        
        # Update step
        new_centroids = update_centroids(X, labels, k)
        
        # Check convergence
        if np.all(np.abs(new_centroids - centroids) < tolerance):
            print(f"Converged after {iteration + 1} iterations")
            break
            
        centroids = new_centroids
    
    return labels, centroids

# Test on synthetic data
np.random.seed(42)
X = np.vstack([
    np.random.randn(100, 2) + [2, 2],
    np.random.randn(100, 2) + [-2, -2],
    np.random.randn(100, 2) + [2, -2]
])

labels, centroids = kmeans_from_scratch(X, k=3)
print(f"Final centroids:\n{centroids}")

This implementation demonstrates the core logic but lacks optimizations present in production libraries.

Using Scikit-learn’s KMeans

For real projects, use scikit-learn’s optimized implementation:

from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import pandas as pd

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Initialize and fit K-Means
kmeans = KMeans(
    n_clusters=3,           # Number of clusters
    init='k-means++',       # Smart initialization (default)
    max_iter=300,           # Maximum iterations
    n_init=10,              # Run algorithm 10 times, pick best
    random_state=42         # Reproducibility
)

# Fit and predict
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Predict on new data
new_samples = [[5.0, 3.5, 1.5, 0.5], [6.5, 3.0, 5.5, 2.0]]
predictions = kmeans.predict(new_samples)

print(f"Cluster labels: {labels[:10]}")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")
print(f"Predictions for new samples: {predictions}")

Key parameters to understand:

  • n_clusters: The K value—must be specified upfront
  • init: ‘k-means++’ is superior to ‘random’
  • n_init: Runs the algorithm multiple times with different initializations
  • max_iter: Prevents infinite loops on non-converging data

Determining Optimal K

Choosing K is critical but not always obvious. Two primary methods help:

Elbow Method: Plot inertia (within-cluster sum of squares) against K values. Look for the “elbow” where adding clusters yields diminishing returns.

Silhouette Score: Measures how similar points are to their own cluster versus other clusters. Ranges from -1 to 1, with higher values indicating better-defined clusters.

from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Elbow Method
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plotting
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('Number of Clusters (K)')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
ax1.grid(True)

ax2.plot(K_range, silhouette_scores, 'ro-')
ax2.set_xlabel('Number of Clusters (K)')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
ax2.grid(True)

plt.tight_layout()
plt.show()

print("Silhouette scores:", [f"{s:.3f}" for s in silhouette_scores])

For the Iris dataset, you’ll see an elbow around K=3 (matching the actual species count) and a high silhouette score at the same point.

Real-World Application: Customer Segmentation

Let’s build a complete customer segmentation pipeline:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt

# Create synthetic customer data
np.random.seed(42)
n_customers = 500

data = pd.DataFrame({
    'annual_income': np.random.normal(60000, 20000, n_customers),
    'spending_score': np.random.randint(1, 100, n_customers),
    'age': np.random.randint(18, 70, n_customers),
    'purchase_frequency': np.random.poisson(10, n_customers)
})

# Feature scaling (critical for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
data['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze clusters
cluster_summary = data.groupby('cluster').mean()
print("Cluster Summary:")
print(cluster_summary)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Income vs Spending Score
sns.scatterplot(data=data, x='annual_income', y='spending_score', 
                hue='cluster', palette='viridis', s=100, ax=axes[0])
axes[0].set_title('Customer Segments: Income vs Spending')

# Age vs Purchase Frequency
sns.scatterplot(data=data, x='age', y='purchase_frequency', 
                hue='cluster', palette='viridis', s=100, ax=axes[1])
axes[1].set_title('Customer Segments: Age vs Frequency')

plt.tight_layout()
plt.show()

This pipeline demonstrates feature scaling (essential because K-Means uses distance metrics), clustering, and interpretation through visualization and summary statistics. Each cluster represents a customer segment with distinct characteristics.

Limitations and Best Practices

K-Means has important limitations:

Assumes Spherical Clusters: Struggles with elongated or irregular cluster shapes.

Sensitive to Scale: Always standardize features before clustering.

Requires Pre-specified K: You must decide the number of clusters upfront.

Sensitive to Outliers: Outliers can significantly skew centroids.

Here’s a demonstration of K-Means failing on non-spherical data:

from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Create non-spherical dataset
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# K-Means (poor performance)
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_moons)

# DBSCAN (better for non-spherical clusters)
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_moons)

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.scatter(X_moons[:, 0], X_moons[:, 1], c=kmeans_labels, cmap='viridis')
ax1.set_title('K-Means (Struggles with Shape)')

ax2.scatter(X_moons[:, 0], X_moons[:, 1], c=dbscan_labels, cmap='viridis')
ax2.set_title('DBSCAN (Handles Shape Well)')

plt.tight_layout()
plt.show()

Best Practices:

  • Always scale your features with StandardScaler or MinMaxScaler
  • Use K-Means++ initialization (scikit-learn’s default)
  • Run multiple initializations (n_init=10 or higher)
  • Validate with silhouette scores or domain knowledge
  • Consider alternatives (DBSCAN for arbitrary shapes, hierarchical clustering for dendrograms)
  • Check cluster sizes—very uneven clusters may indicate poor fit

K-Means remains a powerful first approach for clustering problems. Master its implementation, understand its assumptions, and know when to reach for more sophisticated alternatives.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.