Database Sharding: Consistent Hashing and Range-Based

Key Insights

Range-based sharding offers simple implementation and efficient range queries but suffers from hotspots and requires manual rebalancing when shards become uneven
Consistent hashing distributes data uniformly across shards and minimizes rebalancing when adding or removing nodes, making it ideal for dynamic clusters
The choice between strategies depends on your query patterns: use range-based for sequential access and analytics, consistent hashing for distributed key-value workloads

Understanding Database Sharding

Sharding is horizontal partitioning at the database level—splitting your data across multiple physical databases based on a shard key. When your database hits millions of rows and query performance degrades despite proper indexing, or when a single server can’t handle the write throughput, sharding becomes necessary.

The core concept is simple: instead of one massive database, you distribute data across multiple smaller databases (shards). Each shard contains a subset of the total data, determined by the sharding strategy. The challenge lies in deciding how to distribute this data and route queries to the correct shard.

# Monolithic approach
class MonolithicDB:
    def get_user(self, user_id):
        return self.db.query("SELECT * FROM users WHERE id = ?", user_id)

# Sharded approach
class ShardedDB:
    def __init__(self, shards):
        self.shards = shards  # List of database connections
        
    def get_user(self, user_id):
        shard = self.route_to_shard(user_id)
        return shard.query("SELECT * FROM users WHERE id = ?", user_id)
    
    def route_to_shard(self, user_id):
        # Strategy determines which shard to use
        pass

Range-Based Sharding: Simple but Problematic

Range-based sharding divides data into continuous ranges based on the shard key. If you’re sharding users by ID, shard 0 might contain users 1-1,000,000, shard 1 contains 1,000,001-2,000,000, and so on.

The implementation is straightforward:

class RangeBasedSharding:
    def __init__(self, shards, range_size=1_000_000):
        self.shards = shards
        self.range_size = range_size
    
    def get_shard(self, user_id):
        shard_index = user_id // self.range_size
        if shard_index >= len(self.shards):
            raise ValueError(f"No shard for user_id {user_id}")
        return self.shards[shard_index]
    
    def get_users_in_range(self, start_id, end_id):
        # Efficient range queries
        start_shard = start_id // self.range_size
        end_shard = end_id // self.range_size
        
        results = []
        for shard_idx in range(start_shard, end_shard + 1):
            shard = self.shards[shard_idx]
            results.extend(shard.query(
                "SELECT * FROM users WHERE id BETWEEN ? AND ?",
                max(start_id, shard_idx * self.range_size),
                min(end_id, (shard_idx + 1) * self.range_size - 1)
            ))
        return results

Range-based sharding excels at range queries and sequential access patterns. If you need to fetch all users created in a specific time period, you can query specific shards rather than all of them.

However, the problems are significant:

Hotspots: If your application creates users sequentially, all writes go to the latest shard while older shards sit idle. The newest shard becomes a bottleneck.

Uneven distribution: Not all ranges contain equal amounts of data. User IDs 1-1,000,000 might represent inactive legacy accounts while 9,000,001-10,000,000 are active users generating constant queries.

Manual rebalancing: When a shard fills up, you need to manually split it and redistribute data. This is operationally complex and often requires downtime.

-- Querying a specific shard is straightforward
-- Connect to shard_2.db
SELECT * FROM users WHERE id BETWEEN 2000001 AND 3000000;

-- But cross-shard queries require application-level aggregation
-- Must query multiple shards and merge results

Consistent Hashing: The Dynamic Solution

Consistent hashing solves the rebalancing problem by mapping both data keys and shard nodes onto a fixed hash ring (typically 0 to 2^32-1). When you hash a key, you move clockwise around the ring until you find a node—that’s where the data lives.

The breakthrough is what happens when you add or remove nodes: only a small portion of keys need to move, proportional to 1/N where N is the number of nodes. In range-based sharding, adding a node could require redistributing half your data.

import hashlib
from bisect import bisect_right

class ConsistentHashRing:
    def __init__(self, nodes=None, virtual_nodes=150):
        self.virtual_nodes = virtual_nodes
        self.ring = {}
        self.sorted_keys = []
        
        if nodes:
            for node in nodes:
                self.add_node(node)
    
    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def add_node(self, node):
        """Add a node with virtual nodes for better distribution"""
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:{i}"
            hash_key = self._hash(virtual_key)
            self.ring[hash_key] = node
            self.sorted_keys.append(hash_key)
        
        self.sorted_keys.sort()
    
    def remove_node(self, node):
        """Remove a node and its virtual nodes"""
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:{i}"
            hash_key = self._hash(virtual_key)
            del self.ring[hash_key]
            self.sorted_keys.remove(hash_key)
    
    def get_node(self, key):
        """Find which node should store this key"""
        if not self.ring:
            return None
        
        hash_key = self._hash(str(key))
        # Find the first node clockwise from the hash
        index = bisect_right(self.sorted_keys, hash_key)
        
        # Wrap around if we're past the last node
        if index == len(self.sorted_keys):
            index = 0
        
        return self.ring[self.sorted_keys[index]]

Virtual nodes are critical. Without them, you might have uneven distribution if your hash function doesn’t spread values uniformly. By creating 150 virtual nodes per physical node, you ensure each physical node appears at many points on the ring, smoothing out the distribution.

# Demonstrating distribution quality
ring = ConsistentHashRing(['shard_0', 'shard_1', 'shard_2'], virtual_nodes=150)

# Simulate distributing 10,000 keys
distribution = {}
for i in range(10000):
    node = ring.get_node(f"user_{i}")
    distribution[node] = distribution.get(node, 0) + 1

print(distribution)
# Output: {'shard_0': 3331, 'shard_1': 3342, 'shard_2': 3327}
# Nearly perfect distribution

# Now add a fourth shard
ring.add_node('shard_3')
new_distribution = {}
for i in range(10000):
    node = ring.get_node(f"user_{i}")
    new_distribution[node] = new_distribution.get(node, 0) + 1

print(new_distribution)
# Output: {'shard_0': 2489, 'shard_1': 2512, 'shard_2': 2501, 'shard_3': 2498}
# Only ~25% of keys moved (to the new shard), not 75%

Comparing the Approaches

Aspect	Range-Based	Consistent Hashing
Implementation	Simple	Moderate complexity
Range queries	Efficient	Requires querying all shards
Hotspots	Common (sequential writes)	Rare (uniform distribution)
Rebalancing	Manual, expensive	Automatic, minimal data movement
Adding nodes	Requires resharding	Add to ring, ~1/N keys move
Use cases	Time-series, sequential IDs	User data, sessions, caching

Here’s a simulation showing the rebalancing difference:

def simulate_rebalancing(strategy, num_keys=10000, old_shards=3, new_shards=4):
    """Compare data movement when adding a shard"""
    
    if strategy == "range":
        # Range-based: need to recalculate all ranges
        old_range_size = num_keys // old_shards
        new_range_size = num_keys // new_shards
        
        moved_keys = 0
        for key in range(num_keys):
            old_shard = key // old_range_size
            new_shard = key // new_range_size
            if old_shard != new_shard:
                moved_keys += 1
        
        return moved_keys
    
    else:  # consistent hashing
        old_ring = ConsistentHashRing([f"shard_{i}" for i in range(old_shards)])
        new_ring = ConsistentHashRing([f"shard_{i}" for i in range(new_shards)])
        
        moved_keys = 0
        for key in range(num_keys):
            if old_ring.get_node(str(key)) != new_ring.get_node(str(key)):
                moved_keys += 1
        
        return moved_keys

print(f"Range-based moved: {simulate_rebalancing('range')}")  # ~2500 keys
print(f"Consistent hash moved: {simulate_rebalancing('hash')}")  # ~2500 keys

# But the operational difference is huge:
# Range: must manually split shards, update routing logic, coordinate migration
# Consistent: add node to ring, let it pull its data, update ring config

Real-World Considerations

Shard key selection is critical. Choose a key with high cardinality that distributes evenly. User ID works well; country code does not (unless you want geographic sharding).

Cross-shard queries are expensive. If you need to join users with orders and they’re on different shards, you’ll need application-level joins. Design your schema to minimize this.

Monitor shard health continuously:

class ShardMonitor:
    def check_distribution(self, sharding_strategy, sample_size=1000):
        """Detect imbalanced shards"""
        distribution = {}
        
        for i in range(sample_size):
            shard = sharding_strategy.get_shard(i)
            shard_id = id(shard)
            distribution[shard_id] = distribution.get(shard_id, 0) + 1
        
        avg = sample_size / len(distribution)
        for shard_id, count in distribution.items():
            deviation = abs(count - avg) / avg
            if deviation > 0.2:  # More than 20% deviation
                print(f"Warning: Shard {shard_id} has {deviation*100:.1f}% deviation")
        
        return distribution

Migration strategies: When rebalancing, use dual-write periods where you write to both old and new locations, then backfill historical data, then switch reads, then delete old data. Never do big-bang migrations.

Choosing Your Strategy

Use range-based sharding when:

You have naturally sequential data (time-series, logs)
Range queries are common
Your data growth is predictable
You can tolerate manual rebalancing

Use consistent hashing when:

You need dynamic scaling
Data access is random (key-value lookups)
You want to minimize operational overhead
Uniform distribution is critical

Many production systems use hybrid approaches: range-based sharding by date for time-series data, with consistent hashing within each date range for user data. Instagram shards by creation time first, then by user ID. This gives them efficient time-based queries while distributing users evenly.

The future is auto-sharding systems like Vitess (for MySQL) or Citus (for PostgreSQL) that handle distribution transparently. But understanding these fundamentals helps you debug issues, optimize queries, and make informed architecture decisions even when using managed solutions.