Database Sharding: Consistent Hashing and Range-Based
Sharding is horizontal partitioning at the database level—splitting your data across multiple physical databases based on a shard key. When your database hits millions of rows and query performance...
Key Insights
- Range-based sharding offers simple implementation and efficient range queries but suffers from hotspots and requires manual rebalancing when shards become uneven
- Consistent hashing distributes data uniformly across shards and minimizes rebalancing when adding or removing nodes, making it ideal for dynamic clusters
- The choice between strategies depends on your query patterns: use range-based for sequential access and analytics, consistent hashing for distributed key-value workloads
Understanding Database Sharding
Sharding is horizontal partitioning at the database level—splitting your data across multiple physical databases based on a shard key. When your database hits millions of rows and query performance degrades despite proper indexing, or when a single server can’t handle the write throughput, sharding becomes necessary.
The core concept is simple: instead of one massive database, you distribute data across multiple smaller databases (shards). Each shard contains a subset of the total data, determined by the sharding strategy. The challenge lies in deciding how to distribute this data and route queries to the correct shard.
# Monolithic approach
class MonolithicDB:
def get_user(self, user_id):
return self.db.query("SELECT * FROM users WHERE id = ?", user_id)
# Sharded approach
class ShardedDB:
def __init__(self, shards):
self.shards = shards # List of database connections
def get_user(self, user_id):
shard = self.route_to_shard(user_id)
return shard.query("SELECT * FROM users WHERE id = ?", user_id)
def route_to_shard(self, user_id):
# Strategy determines which shard to use
pass
Range-Based Sharding: Simple but Problematic
Range-based sharding divides data into continuous ranges based on the shard key. If you’re sharding users by ID, shard 0 might contain users 1-1,000,000, shard 1 contains 1,000,001-2,000,000, and so on.
The implementation is straightforward:
class RangeBasedSharding:
def __init__(self, shards, range_size=1_000_000):
self.shards = shards
self.range_size = range_size
def get_shard(self, user_id):
shard_index = user_id // self.range_size
if shard_index >= len(self.shards):
raise ValueError(f"No shard for user_id {user_id}")
return self.shards[shard_index]
def get_users_in_range(self, start_id, end_id):
# Efficient range queries
start_shard = start_id // self.range_size
end_shard = end_id // self.range_size
results = []
for shard_idx in range(start_shard, end_shard + 1):
shard = self.shards[shard_idx]
results.extend(shard.query(
"SELECT * FROM users WHERE id BETWEEN ? AND ?",
max(start_id, shard_idx * self.range_size),
min(end_id, (shard_idx + 1) * self.range_size - 1)
))
return results
Range-based sharding excels at range queries and sequential access patterns. If you need to fetch all users created in a specific time period, you can query specific shards rather than all of them.
However, the problems are significant:
Hotspots: If your application creates users sequentially, all writes go to the latest shard while older shards sit idle. The newest shard becomes a bottleneck.
Uneven distribution: Not all ranges contain equal amounts of data. User IDs 1-1,000,000 might represent inactive legacy accounts while 9,000,001-10,000,000 are active users generating constant queries.
Manual rebalancing: When a shard fills up, you need to manually split it and redistribute data. This is operationally complex and often requires downtime.
-- Querying a specific shard is straightforward
-- Connect to shard_2.db
SELECT * FROM users WHERE id BETWEEN 2000001 AND 3000000;
-- But cross-shard queries require application-level aggregation
-- Must query multiple shards and merge results
Consistent Hashing: The Dynamic Solution
Consistent hashing solves the rebalancing problem by mapping both data keys and shard nodes onto a fixed hash ring (typically 0 to 2^32-1). When you hash a key, you move clockwise around the ring until you find a node—that’s where the data lives.
The breakthrough is what happens when you add or remove nodes: only a small portion of keys need to move, proportional to 1/N where N is the number of nodes. In range-based sharding, adding a node could require redistributing half your data.
import hashlib
from bisect import bisect_right
class ConsistentHashRing:
def __init__(self, nodes=None, virtual_nodes=150):
self.virtual_nodes = virtual_nodes
self.ring = {}
self.sorted_keys = []
if nodes:
for node in nodes:
self.add_node(node)
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node):
"""Add a node with virtual nodes for better distribution"""
for i in range(self.virtual_nodes):
virtual_key = f"{node}:{i}"
hash_key = self._hash(virtual_key)
self.ring[hash_key] = node
self.sorted_keys.append(hash_key)
self.sorted_keys.sort()
def remove_node(self, node):
"""Remove a node and its virtual nodes"""
for i in range(self.virtual_nodes):
virtual_key = f"{node}:{i}"
hash_key = self._hash(virtual_key)
del self.ring[hash_key]
self.sorted_keys.remove(hash_key)
def get_node(self, key):
"""Find which node should store this key"""
if not self.ring:
return None
hash_key = self._hash(str(key))
# Find the first node clockwise from the hash
index = bisect_right(self.sorted_keys, hash_key)
# Wrap around if we're past the last node
if index == len(self.sorted_keys):
index = 0
return self.ring[self.sorted_keys[index]]
Virtual nodes are critical. Without them, you might have uneven distribution if your hash function doesn’t spread values uniformly. By creating 150 virtual nodes per physical node, you ensure each physical node appears at many points on the ring, smoothing out the distribution.
# Demonstrating distribution quality
ring = ConsistentHashRing(['shard_0', 'shard_1', 'shard_2'], virtual_nodes=150)
# Simulate distributing 10,000 keys
distribution = {}
for i in range(10000):
node = ring.get_node(f"user_{i}")
distribution[node] = distribution.get(node, 0) + 1
print(distribution)
# Output: {'shard_0': 3331, 'shard_1': 3342, 'shard_2': 3327}
# Nearly perfect distribution
# Now add a fourth shard
ring.add_node('shard_3')
new_distribution = {}
for i in range(10000):
node = ring.get_node(f"user_{i}")
new_distribution[node] = new_distribution.get(node, 0) + 1
print(new_distribution)
# Output: {'shard_0': 2489, 'shard_1': 2512, 'shard_2': 2501, 'shard_3': 2498}
# Only ~25% of keys moved (to the new shard), not 75%
Comparing the Approaches
| Aspect | Range-Based | Consistent Hashing |
|---|---|---|
| Implementation | Simple | Moderate complexity |
| Range queries | Efficient | Requires querying all shards |
| Hotspots | Common (sequential writes) | Rare (uniform distribution) |
| Rebalancing | Manual, expensive | Automatic, minimal data movement |
| Adding nodes | Requires resharding | Add to ring, ~1/N keys move |
| Use cases | Time-series, sequential IDs | User data, sessions, caching |
Here’s a simulation showing the rebalancing difference:
def simulate_rebalancing(strategy, num_keys=10000, old_shards=3, new_shards=4):
"""Compare data movement when adding a shard"""
if strategy == "range":
# Range-based: need to recalculate all ranges
old_range_size = num_keys // old_shards
new_range_size = num_keys // new_shards
moved_keys = 0
for key in range(num_keys):
old_shard = key // old_range_size
new_shard = key // new_range_size
if old_shard != new_shard:
moved_keys += 1
return moved_keys
else: # consistent hashing
old_ring = ConsistentHashRing([f"shard_{i}" for i in range(old_shards)])
new_ring = ConsistentHashRing([f"shard_{i}" for i in range(new_shards)])
moved_keys = 0
for key in range(num_keys):
if old_ring.get_node(str(key)) != new_ring.get_node(str(key)):
moved_keys += 1
return moved_keys
print(f"Range-based moved: {simulate_rebalancing('range')}") # ~2500 keys
print(f"Consistent hash moved: {simulate_rebalancing('hash')}") # ~2500 keys
# But the operational difference is huge:
# Range: must manually split shards, update routing logic, coordinate migration
# Consistent: add node to ring, let it pull its data, update ring config
Real-World Considerations
Shard key selection is critical. Choose a key with high cardinality that distributes evenly. User ID works well; country code does not (unless you want geographic sharding).
Cross-shard queries are expensive. If you need to join users with orders and they’re on different shards, you’ll need application-level joins. Design your schema to minimize this.
Monitor shard health continuously:
class ShardMonitor:
def check_distribution(self, sharding_strategy, sample_size=1000):
"""Detect imbalanced shards"""
distribution = {}
for i in range(sample_size):
shard = sharding_strategy.get_shard(i)
shard_id = id(shard)
distribution[shard_id] = distribution.get(shard_id, 0) + 1
avg = sample_size / len(distribution)
for shard_id, count in distribution.items():
deviation = abs(count - avg) / avg
if deviation > 0.2: # More than 20% deviation
print(f"Warning: Shard {shard_id} has {deviation*100:.1f}% deviation")
return distribution
Migration strategies: When rebalancing, use dual-write periods where you write to both old and new locations, then backfill historical data, then switch reads, then delete old data. Never do big-bang migrations.
Choosing Your Strategy
Use range-based sharding when:
- You have naturally sequential data (time-series, logs)
- Range queries are common
- Your data growth is predictable
- You can tolerate manual rebalancing
Use consistent hashing when:
- You need dynamic scaling
- Data access is random (key-value lookups)
- You want to minimize operational overhead
- Uniform distribution is critical
Many production systems use hybrid approaches: range-based sharding by date for time-series data, with consistent hashing within each date range for user data. Instagram shards by creation time first, then by user ID. This gives them efficient time-based queries while distributing users evenly.
The future is auto-sharding systems like Vitess (for MySQL) or Citus (for PostgreSQL) that handle distribution transparently. But understanding these fundamentals helps you debug issues, optimize queries, and make informed architecture decisions even when using managed solutions.