Hash Map Collision Resolution: Chaining vs Open Addressing
Every hash map implementation faces an uncomfortable mathematical reality: the pigeonhole principle guarantees collisions. If you're mapping a potentially infinite key space into a finite array of...
Key Insights
- Chaining offers simpler implementation and handles high load factors gracefully, but suffers from poor cache locality due to pointer chasing across scattered memory locations.
- Open addressing provides better cache performance and lower memory overhead, but requires careful handling of deletions and degrades rapidly as load factor approaches 1.0.
- Modern language implementations often use hybrid approaches—Java’s HashMap converts chains to trees at threshold, Python uses open addressing with pseudo-random probing, and newer designs like Swiss Tables optimize for SIMD operations.
Why Collisions Are Inevitable
Every hash map implementation faces an uncomfortable mathematical reality: the pigeonhole principle guarantees collisions. If you’re mapping a potentially infinite key space into a finite array of buckets, multiple keys will eventually hash to the same index. There’s no escaping this.
A hash function transforms your key into an array index:
def simple_hash(key: str, table_size: int) -> int:
return sum(ord(c) for c in key) % table_size
# Demonstration of inevitable collision
table_size = 10
print(f"'abc' hashes to: {simple_hash('abc', table_size)}") # 294 % 10 = 4
print(f"'cba' hashes to: {simple_hash('cba', table_size)}") # 294 % 10 = 4
print(f"'bac' hashes to: {simple_hash('bac', table_size)}") # 294 % 10 = 4
All three strings produce identical hashes. Your collision resolution strategy determines whether your hash map degrades gracefully or collapses into O(n) performance. The two fundamental approaches—chaining and open addressing—make fundamentally different tradeoffs that affect everything from memory usage to cache performance.
Chaining (Separate Chaining)
Chaining is conceptually straightforward: each bucket contains a secondary data structure (typically a linked list) that holds all key-value pairs mapping to that index. When a collision occurs, you simply append to the chain.
class ChainedHashMap:
def __init__(self, capacity: int = 16):
self.capacity = capacity
self.size = 0
self.buckets: list[list[tuple]] = [[] for _ in range(capacity)]
def _hash(self, key) -> int:
return hash(key) % self.capacity
def put(self, key, value) -> None:
index = self._hash(key)
bucket = self.buckets[index]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
bucket.append((key, value))
self.size += 1
def get(self, key):
index = self._hash(key)
for k, v in self.buckets[index]:
if k == key:
return v
raise KeyError(key)
def delete(self, key) -> None:
index = self._hash(key)
bucket = self.buckets[index]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket.pop(i)
self.size -= 1
return
raise KeyError(key)
Advantages of chaining:
- Simple implementation with no complex probing logic
- Load factor can exceed 1.0 without catastrophic failure
- Deletion is trivial—just remove the node
- No clustering effects that plague open addressing
Disadvantages:
- Memory overhead from list/node structures (pointers, allocation metadata)
- Cache-unfriendly: following pointers scatters memory access patterns
- Each node allocation involves system calls (unless using memory pools)
The cache penalty is often underestimated. Modern CPUs fetch memory in cache lines (typically 64 bytes). Chasing pointers to scattered heap locations defeats prefetching and causes cache misses on nearly every comparison in a chain.
Open Addressing Fundamentals
Open addressing eliminates the secondary data structure entirely. All entries live directly in the hash table array. When a collision occurs, you probe for the next available slot according to a deterministic sequence.
Linear probing checks consecutive slots: h(k), h(k)+1, h(k)+2, ...
Quadratic probing uses quadratic increments: h(k), h(k)+1², h(k)+2², ...
Double hashing uses a second hash function: h(k), h(k)+h₂(k), h(k)+2*h₂(k), ...
class OpenAddressedHashMap:
EMPTY = object()
DELETED = object() # Tombstone marker
def __init__(self, capacity: int = 16):
self.capacity = capacity
self.size = 0
self.keys = [self.EMPTY] * capacity
self.values = [None] * capacity
def _hash(self, key) -> int:
return hash(key) % self.capacity
def _probe(self, key):
"""Linear probing generator."""
index = self._hash(key)
for i in range(self.capacity):
yield (index + i) % self.capacity
def put(self, key, value) -> None:
if self.size >= self.capacity * 0.7:
self._resize()
for index in self._probe(key):
if self.keys[index] is self.EMPTY or self.keys[index] is self.DELETED:
self.keys[index] = key
self.values[index] = value
self.size += 1
return
elif self.keys[index] == key:
self.values[index] = value
return
def get(self, key):
for index in self._probe(key):
if self.keys[index] is self.EMPTY:
raise KeyError(key)
if self.keys[index] == key:
return self.values[index]
raise KeyError(key)
def delete(self, key) -> None:
for index in self._probe(key):
if self.keys[index] is self.EMPTY:
raise KeyError(key)
if self.keys[index] == key:
self.keys[index] = self.DELETED # Tombstone
self.values[index] = None
self.size -= 1
return
def _resize(self):
old_keys, old_values = self.keys, self.values
self.capacity *= 2
self.keys = [self.EMPTY] * self.capacity
self.values = [None] * self.capacity
self.size = 0
for key, value in zip(old_keys, old_values):
if key is not self.EMPTY and key is not self.DELETED:
self.put(key, value)
The tombstone mechanism deserves attention. You can’t simply mark a slot as empty on deletion—doing so would break probe sequences for keys inserted after the deleted one. Tombstones preserve probe chain integrity but accumulate over time, degrading performance. Periodic rehashing clears them.
Linear probing suffers from primary clustering: consecutive occupied slots form clusters that grow larger over time. Quadratic probing reduces this but introduces secondary clustering where keys with the same initial hash follow identical probe sequences. Double hashing eliminates both but requires computing a second hash.
Performance Comparison
Load factor (α = n/m, where n is entries and m is capacity) dramatically affects performance differently for each strategy.
import time
import random
import string
def benchmark_load_factors():
"""Compare lookup times at various load factors."""
results = []
for load_factor in [0.3, 0.5, 0.7, 0.9]:
capacity = 10000
num_entries = int(capacity * load_factor)
# Generate random keys
keys = [''.join(random.choices(string.ascii_letters, k=10))
for _ in range(num_entries)]
# Benchmark chaining
chained = ChainedHashMap(capacity)
for key in keys:
chained.put(key, key)
start = time.perf_counter()
for key in keys:
chained.get(key)
chained_time = time.perf_counter() - start
# Benchmark open addressing
open_addr = OpenAddressedHashMap(capacity)
for key in keys:
open_addr.put(key, key)
start = time.perf_counter()
for key in keys:
open_addr.get(key)
open_time = time.perf_counter() - start
results.append((load_factor, chained_time, open_time))
print(f"Load {load_factor}: Chaining={chained_time:.4f}s, Open={open_time:.4f}s")
return results
Time complexity:
- Chaining: O(1 + α) average, O(n) worst case
- Open addressing: O(1/(1-α)) average for successful search, degrades sharply as α → 1
Open addressing’s cache-friendly sequential memory access often beats chaining at low load factors despite similar theoretical complexity. However, as load factor increases, open addressing’s probe sequences lengthen dramatically while chaining’s chains grow linearly.
Real-World Implementations
Production hash maps employ sophisticated optimizations beyond basic algorithms.
Java HashMap uses chaining but converts chains to balanced trees when a bucket exceeds 8 entries (and the table has at least 64 buckets):
// From Java's HashMap source (simplified)
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;
// When adding to a bucket:
if (binCount >= TREEIFY_THRESHOLD - 1) {
treeifyBin(tab, hash); // Convert linked list to red-black tree
}
This hybrid approach caps worst-case lookup at O(log n) even with pathological hash distributions.
Python’s dict uses open addressing with a pseudo-random probing sequence. The perturbation-based probing reduces clustering:
# Conceptual representation of CPython's probing
def python_probe(hash_value, mask):
perturb = hash_value
index = hash_value & mask
while True:
yield index
perturb >>= 5
index = (index * 5 + perturb + 1) & mask
Go’s map uses chaining with fixed-size buckets (8 entries each) that overflow into linked overflow buckets. This balances cache efficiency with chaining’s flexibility.
Rust’s HashMap uses Swiss Tables (hashbrown crate), which leverages SIMD instructions to probe multiple slots simultaneously using metadata bytes storing partial hashes.
Choosing the Right Strategy
Your choice should depend on concrete constraints:
Prefer chaining when:
- Deletion frequency is high (tombstone accumulation hurts open addressing)
- Load factor unpredictability is high
- Key/value pairs are large (pointer overhead becomes proportionally smaller)
- You need worst-case guarantees (with tree fallback like Java)
Prefer open addressing when:
- Memory is constrained
- Cache performance is critical
- Load factor stays below 0.7
- Deletions are rare or you can batch rehashing
Modern optimizations worth knowing:
Robin Hood hashing reduces probe length variance by “stealing” slots from keys with shorter probe distances. This bounds the maximum probe length.
Swiss Tables (used in Abseil, Rust) store metadata separately, enabling SIMD-parallel probing. A 16-byte metadata group can be searched in a single instruction.
Hopscotch hashing constrains entries to a neighborhood of their home bucket, maintaining cache locality while allowing higher load factors.
Conclusion
Chaining and open addressing represent a fundamental tradeoff between implementation simplicity and memory efficiency. Chaining handles adversarial conditions gracefully but pays for it with pointer overhead and cache misses. Open addressing squeezes more performance from cache-friendly access patterns but demands careful load factor management and complicates deletion.
The right choice depends on your specific workload. If you’re uncertain, start with your language’s built-in implementation—they’re heavily optimized and battle-tested. Only roll your own when profiling reveals the hash map as a genuine bottleneck, and even then, consider established libraries like Abseil’s Swiss Tables or hashbrown before implementing from scratch.
Profile first. The theoretical differences often matter less than you’d expect when dominated by hash function cost or key comparison overhead.