Hash Map Collision Resolution: Chaining vs Open Addressing

Key Insights

Chaining offers simpler implementation and handles high load factors gracefully, but suffers from poor cache locality due to pointer chasing across scattered memory locations.
Open addressing provides better cache performance and lower memory overhead, but requires careful handling of deletions and degrades rapidly as load factor approaches 1.0.
Modern language implementations often use hybrid approaches—Java’s HashMap converts chains to trees at threshold, Python uses open addressing with pseudo-random probing, and newer designs like Swiss Tables optimize for SIMD operations.

Why Collisions Are Inevitable

Every hash map implementation faces an uncomfortable mathematical reality: the pigeonhole principle guarantees collisions. If you’re mapping a potentially infinite key space into a finite array of buckets, multiple keys will eventually hash to the same index. There’s no escaping this.

A hash function transforms your key into an array index:

def simple_hash(key: str, table_size: int) -> int:
    return sum(ord(c) for c in key) % table_size

# Demonstration of inevitable collision
table_size = 10
print(f"'abc' hashes to: {simple_hash('abc', table_size)}")  # 294 % 10 = 4
print(f"'cba' hashes to: {simple_hash('cba', table_size)}")  # 294 % 10 = 4
print(f"'bac' hashes to: {simple_hash('bac', table_size)}")  # 294 % 10 = 4

All three strings produce identical hashes. Your collision resolution strategy determines whether your hash map degrades gracefully or collapses into O(n) performance. The two fundamental approaches—chaining and open addressing—make fundamentally different tradeoffs that affect everything from memory usage to cache performance.

Chaining (Separate Chaining)

Chaining is conceptually straightforward: each bucket contains a secondary data structure (typically a linked list) that holds all key-value pairs mapping to that index. When a collision occurs, you simply append to the chain.

class ChainedHashMap:
    def __init__(self, capacity: int = 16):
        self.capacity = capacity
        self.size = 0
        self.buckets: list[list[tuple]] = [[] for _ in range(capacity)]
    
    def _hash(self, key) -> int:
        return hash(key) % self.capacity
    
    def put(self, key, value) -> None:
        index = self._hash(key)
        bucket = self.buckets[index]
        
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        
        bucket.append((key, value))
        self.size += 1
    
    def get(self, key):
        index = self._hash(key)
        for k, v in self.buckets[index]:
            if k == key:
                return v
        raise KeyError(key)
    
    def delete(self, key) -> None:
        index = self._hash(key)
        bucket = self.buckets[index]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket.pop(i)
                self.size -= 1
                return
        raise KeyError(key)

Advantages of chaining:

Simple implementation with no complex probing logic
Load factor can exceed 1.0 without catastrophic failure
Deletion is trivial—just remove the node
No clustering effects that plague open addressing

Disadvantages:

Memory overhead from list/node structures (pointers, allocation metadata)
Cache-unfriendly: following pointers scatters memory access patterns
Each node allocation involves system calls (unless using memory pools)

The cache penalty is often underestimated. Modern CPUs fetch memory in cache lines (typically 64 bytes). Chasing pointers to scattered heap locations defeats prefetching and causes cache misses on nearly every comparison in a chain.

Open Addressing Fundamentals

Open addressing eliminates the secondary data structure entirely. All entries live directly in the hash table array. When a collision occurs, you probe for the next available slot according to a deterministic sequence.

Linear probing checks consecutive slots: h(k), h(k)+1, h(k)+2, ...

Quadratic probing uses quadratic increments: h(k), h(k)+1², h(k)+2², ...

Double hashing uses a second hash function: h(k), h(k)+h₂(k), h(k)+2*h₂(k), ...

class OpenAddressedHashMap:
    EMPTY = object()
    DELETED = object()  # Tombstone marker
    
    def __init__(self, capacity: int = 16):
        self.capacity = capacity
        self.size = 0
        self.keys = [self.EMPTY] * capacity
        self.values = [None] * capacity
    
    def _hash(self, key) -> int:
        return hash(key) % self.capacity
    
    def _probe(self, key):
        """Linear probing generator."""
        index = self._hash(key)
        for i in range(self.capacity):
            yield (index + i) % self.capacity
    
    def put(self, key, value) -> None:
        if self.size >= self.capacity * 0.7:
            self._resize()
        
        for index in self._probe(key):
            if self.keys[index] is self.EMPTY or self.keys[index] is self.DELETED:
                self.keys[index] = key
                self.values[index] = value
                self.size += 1
                return
            elif self.keys[index] == key:
                self.values[index] = value
                return
    
    def get(self, key):
        for index in self._probe(key):
            if self.keys[index] is self.EMPTY:
                raise KeyError(key)
            if self.keys[index] == key:
                return self.values[index]
        raise KeyError(key)
    
    def delete(self, key) -> None:
        for index in self._probe(key):
            if self.keys[index] is self.EMPTY:
                raise KeyError(key)
            if self.keys[index] == key:
                self.keys[index] = self.DELETED  # Tombstone
                self.values[index] = None
                self.size -= 1
                return
    
    def _resize(self):
        old_keys, old_values = self.keys, self.values
        self.capacity *= 2
        self.keys = [self.EMPTY] * self.capacity
        self.values = [None] * self.capacity
        self.size = 0
        
        for key, value in zip(old_keys, old_values):
            if key is not self.EMPTY and key is not self.DELETED:
                self.put(key, value)

The tombstone mechanism deserves attention. You can’t simply mark a slot as empty on deletion—doing so would break probe sequences for keys inserted after the deleted one. Tombstones preserve probe chain integrity but accumulate over time, degrading performance. Periodic rehashing clears them.

Linear probing suffers from primary clustering: consecutive occupied slots form clusters that grow larger over time. Quadratic probing reduces this but introduces secondary clustering where keys with the same initial hash follow identical probe sequences. Double hashing eliminates both but requires computing a second hash.

Performance Comparison

Load factor (α = n/m, where n is entries and m is capacity) dramatically affects performance differently for each strategy.

import time
import random
import string

def benchmark_load_factors():
    """Compare lookup times at various load factors."""
    results = []
    
    for load_factor in [0.3, 0.5, 0.7, 0.9]:
        capacity = 10000
        num_entries = int(capacity * load_factor)
        
        # Generate random keys
        keys = [''.join(random.choices(string.ascii_letters, k=10)) 
                for _ in range(num_entries)]
        
        # Benchmark chaining
        chained = ChainedHashMap(capacity)
        for key in keys:
            chained.put(key, key)
        
        start = time.perf_counter()
        for key in keys:
            chained.get(key)
        chained_time = time.perf_counter() - start
        
        # Benchmark open addressing
        open_addr = OpenAddressedHashMap(capacity)
        for key in keys:
            open_addr.put(key, key)
        
        start = time.perf_counter()
        for key in keys:
            open_addr.get(key)
        open_time = time.perf_counter() - start
        
        results.append((load_factor, chained_time, open_time))
        print(f"Load {load_factor}: Chaining={chained_time:.4f}s, Open={open_time:.4f}s")
    
    return results

Time complexity:

Chaining: O(1 + α) average, O(n) worst case
Open addressing: O(1/(1-α)) average for successful search, degrades sharply as α → 1

Open addressing’s cache-friendly sequential memory access often beats chaining at low load factors despite similar theoretical complexity. However, as load factor increases, open addressing’s probe sequences lengthen dramatically while chaining’s chains grow linearly.

Real-World Implementations

Production hash maps employ sophisticated optimizations beyond basic algorithms.

Java HashMap uses chaining but converts chains to balanced trees when a bucket exceeds 8 entries (and the table has at least 64 buckets):

// From Java's HashMap source (simplified)
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;

// When adding to a bucket:
if (binCount >= TREEIFY_THRESHOLD - 1) {
    treeifyBin(tab, hash);  // Convert linked list to red-black tree
}

This hybrid approach caps worst-case lookup at O(log n) even with pathological hash distributions.

Python’s dict uses open addressing with a pseudo-random probing sequence. The perturbation-based probing reduces clustering:

# Conceptual representation of CPython's probing
def python_probe(hash_value, mask):
    perturb = hash_value
    index = hash_value & mask
    while True:
        yield index
        perturb >>= 5
        index = (index * 5 + perturb + 1) & mask

Go’s map uses chaining with fixed-size buckets (8 entries each) that overflow into linked overflow buckets. This balances cache efficiency with chaining’s flexibility.

Rust’s HashMap uses Swiss Tables (hashbrown crate), which leverages SIMD instructions to probe multiple slots simultaneously using metadata bytes storing partial hashes.

Choosing the Right Strategy

Your choice should depend on concrete constraints:

Prefer chaining when:

Deletion frequency is high (tombstone accumulation hurts open addressing)
Load factor unpredictability is high
Key/value pairs are large (pointer overhead becomes proportionally smaller)
You need worst-case guarantees (with tree fallback like Java)

Prefer open addressing when:

Memory is constrained
Cache performance is critical
Load factor stays below 0.7
Deletions are rare or you can batch rehashing

Modern optimizations worth knowing:

Robin Hood hashing reduces probe length variance by “stealing” slots from keys with shorter probe distances. This bounds the maximum probe length.

Swiss Tables (used in Abseil, Rust) store metadata separately, enabling SIMD-parallel probing. A 16-byte metadata group can be searched in a single instruction.

Hopscotch hashing constrains entries to a neighborhood of their home bucket, maintaining cache locality while allowing higher load factors.

Conclusion

Chaining and open addressing represent a fundamental tradeoff between implementation simplicity and memory efficiency. Chaining handles adversarial conditions gracefully but pays for it with pointer overhead and cache misses. Open addressing squeezes more performance from cache-friendly access patterns but demands careful load factor management and complicates deletion.

The right choice depends on your specific workload. If you’re uncertain, start with your language’s built-in implementation—they’re heavily optimized and battle-tested. Only roll your own when profiling reveals the hash map as a genuine bottleneck, and even then, consider established libraries like Abseil’s Swiss Tables or hashbrown before implementing from scratch.

Profile first. The theoretical differences often matter less than you’d expect when dominated by hash function cost or key comparison overhead.