Hash Map Load Factor and Rehashing
Hash maps promise O(1) average-case lookups, inserts, and deletes. This promise comes with an asterisk that most developers ignore until their production system starts crawling.
Key Insights
- Load factor determines when your hash map trades memory for speed—the default 0.75 works for most cases, but understanding why helps you optimize for specific workloads.
- Rehashing is expensive (O(n)) but unavoidable; you can minimize its impact by pre-sizing when you know the expected size or using incremental rehashing for latency-sensitive applications.
- High load factors don’t just waste CPU on collisions—they can transform your O(1) lookups into O(n) operations, destroying the entire point of using a hash map.
Introduction to Hash Map Performance
Hash maps promise O(1) average-case lookups, inserts, and deletes. This promise comes with an asterisk that most developers ignore until their production system starts crawling.
The reality is that hash map performance depends entirely on how well you manage the relationship between the number of stored elements and the number of available buckets. Pack too many elements into too few buckets, and you’re no longer using a hash map—you’re using an expensive linked list.
This article covers the mechanics of load factor, the rehashing process that keeps hash maps fast, and practical strategies for tuning these parameters in real applications.
Understanding Load Factor
Load factor is the ratio of stored elements to available buckets:
load_factor = number_of_elements / number_of_buckets
A load factor of 0.75 means 75% of your buckets contain at least one element. Different languages choose different defaults:
- Java HashMap: 0.75
- Python dict: 2/3 (~0.67)
- Go map: 6.5 elements per bucket (different model, but roughly 0.8 equivalent)
- C++ unordered_map: 1.0
These aren’t arbitrary numbers. A load factor of 0.75 represents a carefully chosen balance: low enough to keep collision chains short, high enough to avoid wasting memory on empty buckets.
Here’s how to inspect the load factor of your data structures:
class LoadFactorInspector:
"""Utility to analyze hash map load characteristics."""
@staticmethod
def calculate_load_factor(element_count: int, bucket_count: int) -> float:
if bucket_count == 0:
return float('inf')
return element_count / bucket_count
@staticmethod
def analyze_dict(d: dict) -> dict:
"""Analyze a Python dict's internal state."""
import sys
# Python dicts don't expose bucket count directly,
# but we can estimate from memory usage patterns
element_count = len(d)
# Python uses powers of 8 for dict sizing (minimum 8)
estimated_buckets = 8
while estimated_buckets * 2 / 3 < element_count:
estimated_buckets *= 2
return {
'elements': element_count,
'estimated_buckets': estimated_buckets,
'load_factor': element_count / estimated_buckets,
'memory_bytes': sys.getsizeof(d)
}
# Usage
inspector = LoadFactorInspector()
test_dict = {i: f"value_{i}" for i in range(1000)}
stats = inspector.analyze_dict(test_dict)
print(f"Load factor: {stats['load_factor']:.3f}")
# Output: Load factor: 0.488
The trade-off is straightforward: lower load factors mean fewer collisions but more wasted memory. Higher load factors save memory but increase the probability that multiple keys hash to the same bucket.
How Collisions Degrade Performance
When two keys hash to the same bucket, the hash map must resolve the collision. Most implementations use either chaining (linked lists per bucket) or open addressing (probing for the next empty slot). Both approaches degrade as load factor increases.
With chaining, a bucket containing k elements requires O(k) time to search. With open addressing, high load factors mean longer probe sequences before finding an empty slot.
Let’s measure this degradation:
import time
import random
from typing import List, Tuple
class ChainedHashMap:
"""Simple hash map with chaining for collision resolution."""
def __init__(self, bucket_count: int):
self.buckets: List[List[Tuple]] = [[] for _ in range(bucket_count)]
self.size = 0
self.bucket_count = bucket_count
def _hash(self, key) -> int:
return hash(key) % self.bucket_count
def put(self, key, value):
bucket_idx = self._hash(key)
bucket = self.buckets[bucket_idx]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
bucket.append((key, value))
self.size += 1
def get(self, key):
bucket_idx = self._hash(key)
for k, v in self.buckets[bucket_idx]:
if k == key:
return v
raise KeyError(key)
@property
def load_factor(self) -> float:
return self.size / self.bucket_count
def benchmark_load_factors():
"""Measure lookup times at different load factors."""
element_count = 10000
load_factors = [0.5, 0.75, 0.9, 1.5, 2.0]
results = []
for target_lf in load_factors:
bucket_count = int(element_count / target_lf)
hm = ChainedHashMap(bucket_count)
# Insert elements
keys = [f"key_{i}" for i in range(element_count)]
for key in keys:
hm.put(key, "value")
# Benchmark lookups
lookup_keys = random.sample(keys, 1000)
start = time.perf_counter()
for _ in range(100): # 100 iterations for stable timing
for key in lookup_keys:
hm.get(key)
elapsed = time.perf_counter() - start
avg_lookup_us = (elapsed / (100 * 1000)) * 1_000_000
results.append((target_lf, hm.load_factor, avg_lookup_us))
print(f"Load factor {target_lf:.2f} (actual: {hm.load_factor:.2f}): "
f"{avg_lookup_us:.3f} µs per lookup")
return results
# Results show clear degradation:
# Load factor 0.50: 0.312 µs per lookup
# Load factor 0.75: 0.398 µs per lookup
# Load factor 0.90: 0.456 µs per lookup
# Load factor 1.50: 0.687 µs per lookup
# Load factor 2.00: 0.891 µs per lookup
The numbers speak clearly: doubling the load factor from 0.75 to 1.5 nearly doubles lookup time. Push it further, and you’re approaching linear search territory.
The Rehashing Process
Rehashing prevents load factor from climbing indefinitely. When inserting an element would exceed the load factor threshold, the hash map:
- Allocates a new bucket array (typically 2x the current size)
- Recomputes the bucket index for every existing element
- Inserts all elements into the new array
- Discards the old array
This process is O(n), but since it happens after O(n) insertions, the amortized cost per insertion remains O(1).
class RehashingHashMap:
"""Hash map with automatic rehashing."""
DEFAULT_CAPACITY = 16
LOAD_FACTOR_THRESHOLD = 0.75
GROWTH_FACTOR = 2
def __init__(self, initial_capacity: int = None):
capacity = initial_capacity or self.DEFAULT_CAPACITY
self.buckets: List[List[Tuple]] = [[] for _ in range(capacity)]
self.size = 0
self.rehash_count = 0
@property
def bucket_count(self) -> int:
return len(self.buckets)
@property
def load_factor(self) -> float:
return self.size / self.bucket_count
def _hash(self, key) -> int:
return hash(key) % self.bucket_count
def _should_rehash(self) -> bool:
return self.load_factor >= self.LOAD_FACTOR_THRESHOLD
def _rehash(self):
"""Double capacity and reinsert all elements."""
old_buckets = self.buckets
new_capacity = self.bucket_count * self.GROWTH_FACTOR
# Allocate new bucket array
self.buckets = [[] for _ in range(new_capacity)]
self.size = 0
self.rehash_count += 1
# Reinsert all elements (hash values change with new bucket count)
for bucket in old_buckets:
for key, value in bucket:
self._insert_no_rehash(key, value)
def _insert_no_rehash(self, key, value):
"""Insert without triggering rehash (used during rehashing)."""
bucket_idx = self._hash(key)
bucket = self.buckets[bucket_idx]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
bucket.append((key, value))
self.size += 1
def put(self, key, value):
if self._should_rehash():
self._rehash()
self._insert_no_rehash(key, value)
def get(self, key):
bucket_idx = self._hash(key)
for k, v in self.buckets[bucket_idx]:
if k == key:
return v
raise KeyError(key)
# Demonstrate rehashing behavior
hm = RehashingHashMap(initial_capacity=4)
for i in range(20):
hm.put(f"key_{i}", i)
if hm.rehash_count > 0:
print(f"After inserting key_{i}: buckets={hm.bucket_count}, "
f"load={hm.load_factor:.2f}, rehashes={hm.rehash_count}")
Rehashing Strategies and Optimizations
The naive approach—rehash everything at once—creates latency spikes. For latency-sensitive applications, incremental rehashing spreads the work across multiple operations.
Redis uses this approach for its dictionaries. During rehashing, the hash table maintains both old and new bucket arrays, migrating a few buckets per operation:
class IncrementalRehashMap:
"""Hash map with incremental rehashing like Redis."""
LOAD_FACTOR_THRESHOLD = 0.75
BUCKETS_PER_OPERATION = 10 # Migrate this many buckets per operation
def __init__(self, initial_capacity: int = 16):
self.primary_buckets: List[List[Tuple]] = [[] for _ in range(initial_capacity)]
self.secondary_buckets: List[List[Tuple]] = None # Used during rehashing
self.size = 0
self.rehash_index = -1 # -1 means not rehashing
@property
def is_rehashing(self) -> bool:
return self.rehash_index >= 0
def _start_rehash(self):
"""Begin incremental rehashing."""
new_capacity = len(self.primary_buckets) * 2
self.secondary_buckets = [[] for _ in range(new_capacity)]
self.rehash_index = 0
def _rehash_step(self):
"""Migrate a few buckets from primary to secondary."""
if not self.is_rehashing:
return
buckets_migrated = 0
while (buckets_migrated < self.BUCKETS_PER_OPERATION and
self.rehash_index < len(self.primary_buckets)):
bucket = self.primary_buckets[self.rehash_index]
for key, value in bucket:
# Recompute hash for new bucket count
new_idx = hash(key) % len(self.secondary_buckets)
self.secondary_buckets[new_idx].append((key, value))
self.primary_buckets[self.rehash_index] = [] # Clear migrated bucket
self.rehash_index += 1
buckets_migrated += 1
# Check if rehashing is complete
if self.rehash_index >= len(self.primary_buckets):
self.primary_buckets = self.secondary_buckets
self.secondary_buckets = None
self.rehash_index = -1
def put(self, key, value):
# Progress rehashing on every operation
if self.is_rehashing:
self._rehash_step()
# Check if we should start rehashing
load = self.size / len(self.primary_buckets)
if not self.is_rehashing and load >= self.LOAD_FACTOR_THRESHOLD:
self._start_rehash()
# Insert into appropriate table
target = self.secondary_buckets if self.is_rehashing else self.primary_buckets
bucket_idx = hash(key) % len(target)
for i, (k, v) in enumerate(target[bucket_idx]):
if k == key:
target[bucket_idx][i] = (key, value)
return
target[bucket_idx].append((key, value))
self.size += 1
def get(self, key):
# Progress rehashing on every operation
if self.is_rehashing:
self._rehash_step()
# Check both tables during rehashing
for buckets in [self.primary_buckets, self.secondary_buckets]:
if buckets is None:
continue
bucket_idx = hash(key) % len(buckets)
for k, v in buckets[bucket_idx]:
if k == key:
return v
raise KeyError(key)
This approach trades slightly higher average latency for dramatically reduced worst-case latency—essential for real-time systems.
Tuning Load Factor for Your Use Case
Default load factors work well for general use, but specific workloads benefit from tuning:
Read-heavy workloads: Lower the load factor to 0.5 or below. The extra memory buys faster lookups.
Memory-constrained environments: Raise the load factor to 0.9 or higher. Accept slower operations to fit more data.
Known size upfront: Pre-size to avoid rehashing entirely.
// Java: Pre-size HashMap to avoid rehashing
public class PreSizedHashMap {
public static <K, V> HashMap<K, V> createForExpectedSize(int expectedSize) {
// Account for load factor: capacity = expectedSize / 0.75
// Add 1 to avoid edge cases
int capacity = (int) Math.ceil(expectedSize / 0.75) + 1;
return new HashMap<>(capacity);
}
public static void main(String[] args) {
// If you know you'll store 10,000 elements:
HashMap<String, Integer> map = createForExpectedSize(10_000);
// No rehashing will occur during population
for (int i = 0; i < 10_000; i++) {
map.put("key_" + i, i);
}
}
}
# Python: Pre-size dict using dict.fromkeys or comprehension
def create_presized_dict(expected_size: int) -> dict:
"""
Python dicts resize at 2/3 capacity.
Pre-allocating isn't directly supported, but you can
hint by creating and clearing.
"""
# This forces Python to allocate appropriate internal size
d = {i: None for i in range(expected_size)}
d.clear() # Keeps allocated memory
return d
Key Takeaways
Load factor management is one of those details that separates production-ready code from demo code. Here’s what matters:
Stick with defaults unless you have data. The 0.75 load factor exists because it works. Only deviate when profiling shows a clear benefit.
Pre-size when possible. If you know you’re loading 100,000 records, tell the hash map upfront. You’ll avoid multiple expensive rehash operations.
Watch for latency spikes. Standard rehashing causes O(n) pauses. For latency-sensitive applications, consider incremental rehashing or pre-sizing.
Higher load factors compound. The relationship between load factor and collision probability isn’t linear. Pushing from 0.75 to 0.9 hurts more than pushing from 0.5 to 0.65.
Memory isn’t free, but neither is CPU. The space-time trade-off here is real. Choose based on your constraints, not ideology.