Perfect Hashing: Minimal Perfect Hash Functions
Every developer who's implemented a hash table knows the pain of collisions. Two different keys hash to the same bucket, and suddenly you're dealing with chaining, probing, or some other resolution...
Key Insights
- Minimal perfect hash functions map n keys to exactly n consecutive integers with zero collisions, enabling O(1) lookups without collision handling overhead—but only work for static, known key sets.
- The CHD algorithm achieves near-optimal space (2-3 bits per key) through a clever two-level scheme using displacement values, making it practical for millions of keys.
- Choose MPHFs when you have a fixed key set and need maximum lookup performance; avoid them when keys change frequently or you need to detect non-existent keys without additional structures.
Introduction to Perfect Hashing
Every developer who’s implemented a hash table knows the pain of collisions. Two different keys hash to the same bucket, and suddenly you’re dealing with chaining, probing, or some other resolution strategy that adds complexity and degrades performance.
But what if you could eliminate collisions entirely?
A perfect hash function guarantees no collisions for a specific, known set of keys. A minimal perfect hash function (MPHF) goes further: it maps n keys to exactly n consecutive integers (0 to n-1), wasting no space in the output range.
This isn’t theoretical elegance for its own sake. MPHFs power real systems where lookup performance is critical and the key set is static:
- Compiler symbol tables: Keywords and built-in functions don’t change between compilations
- Database indexes: Static dimension tables in data warehouses
- DNS resolution: Mapping domain names to IP addresses in high-performance resolvers
- Genomic analysis: Indexing k-mers (DNA subsequences) where the set is determined by the reference genome
The tradeoff is fundamental: you sacrifice the ability to add or remove keys in exchange for perfect, collision-free lookups.
How Minimal Perfect Hash Functions Work
The core insight behind MPHFs is that we’re solving an easier problem than general hashing. We know all keys upfront, so we can spend time during construction to find a function that perfectly distributes them.
Most MPHF algorithms use a two-level approach:
- First-level hash: Partition keys into buckets
- Second-level function: Assign final positions within or across buckets using precomputed auxiliary data
The magic happens in how we compute and store that auxiliary data. Graph-based approaches (like CHD) model the problem as finding an assignment that satisfies constraints. Fingerprint-based approaches use compact bit vectors to encode position information.
Here’s what MPHF lookup looks like in practice—notice the complete absence of collision handling:
class MinimalPerfectHashTable:
def __init__(self, mphf, values):
self.mphf = mphf # The minimal perfect hash function
self.values = values # Values stored at positions 0..n-1
def get(self, key):
# O(1) lookup: compute position, return value
# No chains to walk, no probing sequences
position = self.mphf.lookup(key)
return self.values[position]
# Compare to standard hash table lookup:
def get_standard_hashtable(self, key):
bucket = hash(key) % len(self.buckets)
# Now we must handle collisions...
for stored_key, value in self.buckets[bucket]:
if stored_key == key:
return value
return None
The MPHF version has predictable, constant-time performance. No worst-case degradation when buckets get crowded.
The CHD Algorithm
The Compress, Hash, and Displace (CHD) algorithm, introduced by Belazzougui, Botelho, and Dietzfelbinger in 2009, remains one of the most practical MPHF constructions. It achieves excellent space efficiency (around 2.07 bits per key) with fast lookups.
Here’s how it works:
Step 1: Bucket Assignment Hash each key to one of r buckets (typically r ≈ n/5). Sort buckets by size, largest first.
Step 2: Displacement Calculation
For each bucket (processing largest first), find a displacement value d such that when we compute (hash1(key) + d * hash2(key)) mod n, all keys in that bucket map to currently unoccupied positions.
Step 3: Storage Store only the displacement values. During lookup, we hash the key to find its bucket, retrieve the displacement, and compute the final position.
import mmh3 # MurmurHash3 for quality hashing
class CHDBuilder:
def __init__(self, keys):
self.keys = list(keys)
self.n = len(self.keys)
self.r = max(1, self.n // 5) # Number of buckets
def build(self):
# Step 1: Assign keys to buckets
buckets = [[] for _ in range(self.r)]
for key in self.keys:
bucket_idx = mmh3.hash(key, seed=0) % self.r
buckets[bucket_idx].append(key)
# Sort bucket indices by size (descending)
bucket_order = sorted(range(self.r),
key=lambda i: len(buckets[i]),
reverse=True)
# Step 2: Find displacement for each bucket
occupied = [False] * self.n
displacements = [0] * self.r
for bucket_idx in bucket_order:
bucket = buckets[bucket_idx]
if not bucket:
continue
# Try displacement values until we find one that works
for d in range(self.n * 10): # Upper bound on attempts
positions = []
valid = True
for key in bucket:
h1 = mmh3.hash(key, seed=1)
h2 = mmh3.hash(key, seed=2)
pos = (h1 + d * h2) % self.n
if occupied[pos] or pos in positions:
valid = False
break
positions.append(pos)
if valid:
displacements[bucket_idx] = d
for pos in positions:
occupied[pos] = True
break
else:
raise RuntimeError(f"Failed to place bucket {bucket_idx}")
return CHDFunction(self.n, self.r, displacements)
class CHDFunction:
def __init__(self, n, r, displacements):
self.n = n
self.r = r
self.displacements = displacements
def lookup(self, key):
bucket_idx = mmh3.hash(key, seed=0) % self.r
d = self.displacements[bucket_idx]
h1 = mmh3.hash(key, seed=1)
h2 = mmh3.hash(key, seed=2)
return (h1 + d * h2) % self.n
# Usage example
words = ["apple", "banana", "cherry", "date", "elderberry",
"fig", "grape", "honeydew", "kiwi", "lemon"]
builder = CHDBuilder(words)
mphf = builder.build()
# Verify: each word gets a unique position in 0..9
positions = [mphf.lookup(word) for word in words]
print(f"Positions: {sorted(positions)}") # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Space-Time Tradeoffs
Different MPHF algorithms make different tradeoffs. Here’s how the major implementations compare:
| Algorithm | Bits/Key | Construction | Lookup | Best For |
|---|---|---|---|---|
| CHD | 2.07 | O(n) | 2-3 hashes | Balanced performance |
| BBHash | 3-4 | O(n) | ~2 hashes | Fast construction |
| RecSplit | 1.8 | O(n log n) | ~1.5 hashes | Minimum space |
| PTHash | 2.2 | O(n) | 2 hashes | Modern CPUs |
// Benchmark comparison (conceptual - actual impl varies by library)
#include <chrono>
#include <vector>
#include <string>
struct BenchmarkResult {
double construction_ms;
double avg_lookup_ns;
double bits_per_key;
};
template<typename MPHF>
BenchmarkResult benchmark(const std::vector<std::string>& keys,
size_t lookup_iterations) {
BenchmarkResult result;
// Measure construction
auto start = std::chrono::high_resolution_clock::now();
MPHF mphf(keys);
auto end = std::chrono::high_resolution_clock::now();
result.construction_ms =
std::chrono::duration<double, std::milli>(end - start).count();
// Measure lookups
start = std::chrono::high_resolution_clock::now();
volatile uint64_t sum = 0; // Prevent optimization
for (size_t i = 0; i < lookup_iterations; ++i) {
sum += mphf.lookup(keys[i % keys.size()]);
}
end = std::chrono::high_resolution_clock::now();
result.avg_lookup_ns =
std::chrono::duration<double, std::nano>(end - start).count()
/ lookup_iterations;
result.bits_per_key = mphf.bits_per_key();
return result;
}
Practical Implementation with Existing Libraries
Don’t build your own MPHF for production. Use battle-tested libraries:
BBHash (C++, header-only): Excellent for large key sets, parallelized construction.
#include <BooPHF.h>
#include <vector>
#include <string>
// Custom hasher for string keys
class StringHasher {
public:
uint64_t operator()(const std::string& key, uint64_t seed = 0) const {
uint64_t hash = seed;
for (char c : key) {
hash = hash * 31 + c;
}
return hash;
}
};
int main() {
std::vector<std::string> keys = {"user:1001", "user:1002", "user:1003"};
std::vector<std::string> values = {"Alice", "Bob", "Charlie"};
// Build MPHF with gamma=2.0 (controls space/time tradeoff)
double gamma = 2.0;
auto data_iterator = boomphf::range(keys.begin(), keys.end());
boomphf::mphf<std::string, StringHasher> mphf(
keys.size(), data_iterator,
/*num_threads=*/4, gamma, /*verbose=*/false
);
// Create value array indexed by MPHF positions
std::vector<std::string> indexed_values(keys.size());
for (size_t i = 0; i < keys.size(); ++i) {
uint64_t pos = mphf.lookup(keys[i]);
indexed_values[pos] = values[i];
}
// O(1) lookup
uint64_t pos = mphf.lookup("user:1002");
std::cout << indexed_values[pos] << std::endl; // "Bob"
return 0;
}
cmph (C): The classic library, supports multiple algorithms.
pthash (C++): Modern implementation optimized for cache efficiency.
Limitations and When Not to Use MPHFs
MPHFs have hard constraints that make them unsuitable for many use cases:
Static keys only: Adding or removing a key requires rebuilding the entire function. If your key set changes, use cuckoo hashing or Robin Hood hashing instead.
No membership testing: An MPHF will return some integer for any input, even keys not in the original set. You need an additional Bloom filter or must store keys alongside values to detect invalid queries.
Construction overhead: Building the MPHF takes time and memory. For small key sets (under 1000 keys), a simple hash table is often faster overall.
Debugging difficulty: When something goes wrong, MPHF internals are opaque. Standard hash tables are easier to inspect and debug.
Use MPHFs when:
- Your key set is fixed at build time
- You’re optimizing for lookup latency, not flexibility
- You have millions of keys where space savings matter
- You can verify key existence through other means
Real-World Applications
Compiler symbol tables: GCC and Clang use perfect hashing for keyword recognition. The set of reserved words is fixed, and fast lookup is critical during lexical analysis.
Genomic k-mer indexing: Tools like Pufferfish use MPHFs to index billions of k-mers from reference genomes. The space savings (2-3 bits per k-mer vs. 8+ bytes for a hash table entry) make previously impossible analyses practical.
CDN routing: Content delivery networks use MPHFs to map content identifiers to server locations. With billions of objects, the space efficiency directly translates to cost savings.
Database query optimization: Static dimension tables in data warehouses can use MPHFs for foreign key lookups, eliminating hash collision overhead in join operations.
The pattern is consistent: when you have a large, static key set and need maximum lookup performance, minimal perfect hash functions deliver. They’re a specialized tool, but in the right context, they’re unbeatable.