How to Implement DBSCAN in Python

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points based on density rather than distance from centroids. Unlike K-means, which forces...

Key Insights

  • DBSCAN excels at finding arbitrarily-shaped clusters and identifying outliers without requiring you to specify the number of clusters upfront, making it superior to K-means for real-world messy data
  • The two critical parameters—epsilon (neighborhood radius) and minPts (minimum neighbors)—can be systematically tuned using the k-distance graph method rather than guessing
  • While scikit-learn’s implementation is production-ready, understanding the core algorithm through a custom implementation reveals why spatial indexing matters for datasets beyond a few thousand points

Introduction to DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points based on density rather than distance from centroids. Unlike K-means, which forces every point into a cluster and assumes spherical shapes, DBSCAN identifies clusters of arbitrary shape and explicitly labels outliers as noise.

This makes DBSCAN particularly valuable for anomaly detection, geographic data analysis, and customer segmentation where you don’t know the cluster count beforehand. If you’ve ever tried K-means on crescent-shaped data or datasets with outliers, you’ve experienced its limitations firsthand. DBSCAN solves these problems elegantly.

How DBSCAN Works

DBSCAN operates on two parameters: epsilon (ε), the radius for neighborhood searches, and minPts, the minimum number of points required to form a dense region.

The algorithm classifies points into three categories:

  • Core points: Points with at least minPts neighbors within epsilon radius
  • Border points: Points within epsilon of a core point but with fewer than minPts neighbors
  • Noise points: Points that are neither core nor border points

The algorithm works by:

  1. Randomly selecting an unvisited point
  2. Finding all points within epsilon distance
  3. If the neighborhood contains minPts points, starting a new cluster and expanding it
  4. Recursively adding all density-reachable points to the cluster
  5. Marking isolated points as noise

Here’s a visualization showing how points are classified:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

# Generate sample data
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)

# Manually demonstrate epsilon neighborhoods
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
ax.scatter(X[:, 0], X[:, 1], c='lightblue', s=50, edgecolors='black')

# Show epsilon radius around a sample point
sample_point = X[50]
circle = plt.Circle(sample_point, 0.3, color='red', fill=False, linewidth=2)
ax.add_patch(circle)
ax.scatter(*sample_point, c='red', s=200, marker='*', edgecolors='black', 
           label=f'Sample point (ε=0.3)')

# Highlight neighbors
distances = np.sqrt(np.sum((X - sample_point)**2, axis=1))
neighbors = X[distances <= 0.3]
ax.scatter(neighbors[:, 0], neighbors[:, 1], c='orange', s=100, 
           edgecolors='black', label=f'Neighbors ({len(neighbors)} points)')

ax.set_title('DBSCAN: Epsilon Neighborhood Visualization')
ax.legend()
plt.tight_layout()
plt.show()

Basic Implementation with scikit-learn

The fastest way to use DBSCAN is through scikit-learn. Here’s a complete example:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

# Generate synthetic data
X, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

# Scale features (important for DBSCAN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Original data
ax1.scatter(X[:, 0], X[:, 1], c='gray', s=50)
ax1.set_title('Original Data')

# Clustered data
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

for label, color in zip(unique_labels, colors):
    if label == -1:
        # Noise points in black
        color = 'black'
        marker = 'x'
    else:
        marker = 'o'
    
    mask = labels == label
    ax2.scatter(X[mask, 0], X[mask, 1], c=[color], s=50, 
                marker=marker, label=f'Cluster {label}')

ax2.set_title(f'DBSCAN Clustering (eps=0.3, min_samples=5)\n'
              f'Clusters: {len(set(labels)) - (1 if -1 in labels else 0)}')
ax2.legend()
plt.tight_layout()
plt.show()

print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Number of noise points: {list(labels).count(-1)}")

Notice that scaling is crucial. DBSCAN uses distance calculations, so features on different scales will dominate the epsilon calculation.

Choosing Optimal Parameters

The k-distance graph is the most systematic approach for selecting epsilon. Plot the distance to the k-th nearest neighbor (where k = minPts) for all points, sorted in ascending order. The optimal epsilon is at the “elbow” where the curve sharply increases.

from sklearn.neighbors import NearestNeighbors

def find_optimal_epsilon(X, min_samples=5):
    """Find optimal epsilon using k-distance graph."""
    neighbors = NearestNeighbors(n_neighbors=min_samples)
    neighbors_fit = neighbors.fit(X)
    distances, indices = neighbors_fit.kneighbors(X)
    
    # Sort distances to k-th nearest neighbor
    distances = np.sort(distances[:, min_samples-1], axis=0)
    
    # Plot k-distance graph
    plt.figure(figsize=(10, 6))
    plt.plot(distances)
    plt.ylabel(f'Distance to {min_samples}-th Nearest Neighbor')
    plt.xlabel('Points sorted by distance')
    plt.title('K-distance Graph for Epsilon Selection')
    plt.grid(True)
    plt.show()
    
    return distances

# Use with our data
X_scaled = StandardScaler().fit_transform(X)
distances = find_optimal_epsilon(X_scaled, min_samples=5)

# The elbow in this case appears around index where distance jumps
# For this dataset, epsilon around 0.3 is appropriate

For minPts, a common heuristic is minPts = 2 * dimensions for datasets with dimensions > 2, or simply 4-5 for 2D data.

Real-World Application

Let’s apply DBSCAN to customer segmentation using the Iris dataset as a proxy for customer features:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.cluster import KMeans

# Load data
iris = load_iris()
X = iris.data

# Preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Compare with K-means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Evaluation (excluding noise points for fair comparison)
mask = dbscan_labels != -1
if len(set(dbscan_labels[mask])) > 1:
    dbscan_silhouette = silhouette_score(X_scaled[mask], dbscan_labels[mask])
    dbscan_davies = davies_bouldin_score(X_scaled[mask], dbscan_labels[mask])
else:
    dbscan_silhouette = dbscan_davies = None

kmeans_silhouette = silhouette_score(X_scaled, kmeans_labels)
kmeans_davies = davies_bouldin_score(X_scaled, kmeans_labels)

print("DBSCAN Results:")
print(f"  Clusters: {len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)}")
print(f"  Noise points: {list(dbscan_labels).count(-1)}")
print(f"  Silhouette Score: {dbscan_silhouette:.3f}" if dbscan_silhouette else "  N/A")
print(f"  Davies-Bouldin Index: {dbscan_davies:.3f}" if dbscan_davies else "  N/A")

print("\nK-means Results:")
print(f"  Silhouette Score: {kmeans_silhouette:.3f}")
print(f"  Davies-Bouldin Index: {kmeans_davies:.3f}")

Custom Implementation from Scratch

Understanding the internals helps debug issues and optimize for specific use cases:

class SimpleDBSCAN:
    def __init__(self, eps=0.5, min_samples=5):
        self.eps = eps
        self.min_samples = min_samples
        self.labels_ = None
    
    def fit(self, X):
        n_samples = X.shape[0]
        self.labels_ = np.full(n_samples, -1)  # Initialize as noise
        cluster_id = 0
        
        for i in range(n_samples):
            # Skip if already processed
            if self.labels_[i] != -1:
                continue
            
            # Find neighbors
            neighbors = self._find_neighbors(X, i)
            
            # Check if core point
            if len(neighbors) < self.min_samples:
                continue  # Leave as noise for now
            
            # Start new cluster
            self.labels_[i] = cluster_id
            
            # Expand cluster
            self._expand_cluster(X, neighbors, cluster_id)
            cluster_id += 1
        
        return self
    
    def _find_neighbors(self, X, point_idx):
        """Find all points within eps distance."""
        distances = np.sqrt(np.sum((X - X[point_idx])**2, axis=1))
        return np.where(distances <= self.eps)[0]
    
    def _expand_cluster(self, X, neighbors, cluster_id):
        """Recursively expand cluster to density-reachable points."""
        i = 0
        while i < len(neighbors):
            neighbor_idx = neighbors[i]
            
            # If noise, add to cluster
            if self.labels_[neighbor_idx] == -1:
                self.labels_[neighbor_idx] = cluster_id
            
            # If unprocessed
            if self.labels_[neighbor_idx] == -1:
                self.labels_[neighbor_idx] = cluster_id
                
                # Find neighbors of neighbor
                new_neighbors = self._find_neighbors(X, neighbor_idx)
                
                # If core point, add its neighbors to search
                if len(new_neighbors) >= self.min_samples:
                    neighbors = np.concatenate([neighbors, new_neighbors])
            
            i += 1

# Test custom implementation
custom_dbscan = SimpleDBSCAN(eps=0.3, min_samples=5)
custom_labels = custom_dbscan.fit(X_scaled).labels_

print(f"Custom DBSCAN found {len(set(custom_labels)) - (1 if -1 in custom_labels else 0)} clusters")

Performance Optimization and Best Practices

DBSCAN has O(n²) time complexity with naive distance calculations, but this improves to O(n log n) with spatial indexing. For datasets over 10,000 points, specify the algorithm parameter:

# Performance comparison
import time

X_large, _ = make_moons(n_samples=10000, noise=0.05, random_state=42)
X_large_scaled = StandardScaler().fit_transform(X_large)

algorithms = ['auto', 'ball_tree', 'kd_tree', 'brute']
times = {}

for algo in algorithms:
    start = time.time()
    DBSCAN(eps=0.3, min_samples=5, algorithm=algo).fit(X_large_scaled)
    times[algo] = time.time() - start
    print(f"{algo}: {times[algo]:.4f} seconds")

Key best practices:

  1. Always scale your features before applying DBSCAN
  2. Use the k-distance graph to select epsilon systematically
  3. Start with minPts = 2 * dimensions as a baseline
  4. For high-dimensional data (>10 dimensions), consider dimensionality reduction first
  5. Use spatial indexing (ball_tree or kd_tree) for large datasets
  6. Validate results by examining noise points—too many suggests poor parameters

DBSCAN shines when cluster shapes are non-convex or when outlier detection matters. For spherical clusters with known counts, K-means is faster. Choose the right tool for your data’s geometry.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.