Perfect Hashing: Minimal Perfect Hash Functions

Key Insights

Minimal perfect hash functions map n keys to exactly n consecutive integers with zero collisions, enabling O(1) lookups without collision handling overhead—but only work for static, known key sets.
The CHD algorithm achieves near-optimal space (2-3 bits per key) through a clever two-level scheme using displacement values, making it practical for millions of keys.
Choose MPHFs when you have a fixed key set and need maximum lookup performance; avoid them when keys change frequently or you need to detect non-existent keys without additional structures.

Introduction to Perfect Hashing

Every developer who’s implemented a hash table knows the pain of collisions. Two different keys hash to the same bucket, and suddenly you’re dealing with chaining, probing, or some other resolution strategy that adds complexity and degrades performance.

But what if you could eliminate collisions entirely?

A perfect hash function guarantees no collisions for a specific, known set of keys. A minimal perfect hash function (MPHF) goes further: it maps n keys to exactly n consecutive integers (0 to n-1), wasting no space in the output range.

This isn’t theoretical elegance for its own sake. MPHFs power real systems where lookup performance is critical and the key set is static:

Compiler symbol tables: Keywords and built-in functions don’t change between compilations
Database indexes: Static dimension tables in data warehouses
DNS resolution: Mapping domain names to IP addresses in high-performance resolvers
Genomic analysis: Indexing k-mers (DNA subsequences) where the set is determined by the reference genome

The tradeoff is fundamental: you sacrifice the ability to add or remove keys in exchange for perfect, collision-free lookups.

How Minimal Perfect Hash Functions Work

The core insight behind MPHFs is that we’re solving an easier problem than general hashing. We know all keys upfront, so we can spend time during construction to find a function that perfectly distributes them.

Most MPHF algorithms use a two-level approach:

First-level hash: Partition keys into buckets
Second-level function: Assign final positions within or across buckets using precomputed auxiliary data

The magic happens in how we compute and store that auxiliary data. Graph-based approaches (like CHD) model the problem as finding an assignment that satisfies constraints. Fingerprint-based approaches use compact bit vectors to encode position information.

Here’s what MPHF lookup looks like in practice—notice the complete absence of collision handling:

class MinimalPerfectHashTable:
    def __init__(self, mphf, values):
        self.mphf = mphf  # The minimal perfect hash function
        self.values = values  # Values stored at positions 0..n-1
    
    def get(self, key):
        # O(1) lookup: compute position, return value
        # No chains to walk, no probing sequences
        position = self.mphf.lookup(key)
        return self.values[position]
    
    # Compare to standard hash table lookup:
    def get_standard_hashtable(self, key):
        bucket = hash(key) % len(self.buckets)
        # Now we must handle collisions...
        for stored_key, value in self.buckets[bucket]:
            if stored_key == key:
                return value
        return None

The MPHF version has predictable, constant-time performance. No worst-case degradation when buckets get crowded.

The CHD Algorithm

The Compress, Hash, and Displace (CHD) algorithm, introduced by Belazzougui, Botelho, and Dietzfelbinger in 2009, remains one of the most practical MPHF constructions. It achieves excellent space efficiency (around 2.07 bits per key) with fast lookups.

Here’s how it works:

Step 1: Bucket Assignment Hash each key to one of r buckets (typically r ≈ n/5). Sort buckets by size, largest first.

Step 2: Displacement Calculation For each bucket (processing largest first), find a displacement value d such that when we compute (hash1(key) + d * hash2(key)) mod n, all keys in that bucket map to currently unoccupied positions.

Step 3: Storage Store only the displacement values. During lookup, we hash the key to find its bucket, retrieve the displacement, and compute the final position.

import mmh3  # MurmurHash3 for quality hashing

class CHDBuilder:
    def __init__(self, keys):
        self.keys = list(keys)
        self.n = len(self.keys)
        self.r = max(1, self.n // 5)  # Number of buckets
        
    def build(self):
        # Step 1: Assign keys to buckets
        buckets = [[] for _ in range(self.r)]
        for key in self.keys:
            bucket_idx = mmh3.hash(key, seed=0) % self.r
            buckets[bucket_idx].append(key)
        
        # Sort bucket indices by size (descending)
        bucket_order = sorted(range(self.r), 
                             key=lambda i: len(buckets[i]), 
                             reverse=True)
        
        # Step 2: Find displacement for each bucket
        occupied = [False] * self.n
        displacements = [0] * self.r
        
        for bucket_idx in bucket_order:
            bucket = buckets[bucket_idx]
            if not bucket:
                continue
                
            # Try displacement values until we find one that works
            for d in range(self.n * 10):  # Upper bound on attempts
                positions = []
                valid = True
                
                for key in bucket:
                    h1 = mmh3.hash(key, seed=1)
                    h2 = mmh3.hash(key, seed=2)
                    pos = (h1 + d * h2) % self.n
                    
                    if occupied[pos] or pos in positions:
                        valid = False
                        break
                    positions.append(pos)
                
                if valid:
                    displacements[bucket_idx] = d
                    for pos in positions:
                        occupied[pos] = True
                    break
            else:
                raise RuntimeError(f"Failed to place bucket {bucket_idx}")
        
        return CHDFunction(self.n, self.r, displacements)


class CHDFunction:
    def __init__(self, n, r, displacements):
        self.n = n
        self.r = r
        self.displacements = displacements
    
    def lookup(self, key):
        bucket_idx = mmh3.hash(key, seed=0) % self.r
        d = self.displacements[bucket_idx]
        h1 = mmh3.hash(key, seed=1)
        h2 = mmh3.hash(key, seed=2)
        return (h1 + d * h2) % self.n


# Usage example
words = ["apple", "banana", "cherry", "date", "elderberry", 
         "fig", "grape", "honeydew", "kiwi", "lemon"]
builder = CHDBuilder(words)
mphf = builder.build()

# Verify: each word gets a unique position in 0..9
positions = [mphf.lookup(word) for word in words]
print(f"Positions: {sorted(positions)}")  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Space-Time Tradeoffs

Different MPHF algorithms make different tradeoffs. Here’s how the major implementations compare:

Algorithm	Bits/Key	Construction	Lookup	Best For
CHD	2.07	O(n)	2-3 hashes	Balanced performance
BBHash	3-4	O(n)	~2 hashes	Fast construction
RecSplit	1.8	O(n log n)	~1.5 hashes	Minimum space
PTHash	2.2	O(n)	2 hashes	Modern CPUs

// Benchmark comparison (conceptual - actual impl varies by library)
#include <chrono>
#include <vector>
#include <string>

struct BenchmarkResult {
    double construction_ms;
    double avg_lookup_ns;
    double bits_per_key;
};

template<typename MPHF>
BenchmarkResult benchmark(const std::vector<std::string>& keys, 
                          size_t lookup_iterations) {
    BenchmarkResult result;
    
    // Measure construction
    auto start = std::chrono::high_resolution_clock::now();
    MPHF mphf(keys);
    auto end = std::chrono::high_resolution_clock::now();
    result.construction_ms = 
        std::chrono::duration<double, std::milli>(end - start).count();
    
    // Measure lookups
    start = std::chrono::high_resolution_clock::now();
    volatile uint64_t sum = 0;  // Prevent optimization
    for (size_t i = 0; i < lookup_iterations; ++i) {
        sum += mphf.lookup(keys[i % keys.size()]);
    }
    end = std::chrono::high_resolution_clock::now();
    result.avg_lookup_ns = 
        std::chrono::duration<double, std::nano>(end - start).count() 
        / lookup_iterations;
    
    result.bits_per_key = mphf.bits_per_key();
    return result;
}

Practical Implementation with Existing Libraries

Don’t build your own MPHF for production. Use battle-tested libraries:

BBHash (C++, header-only): Excellent for large key sets, parallelized construction.

#include <BooPHF.h>
#include <vector>
#include <string>

// Custom hasher for string keys
class StringHasher {
public:
    uint64_t operator()(const std::string& key, uint64_t seed = 0) const {
        uint64_t hash = seed;
        for (char c : key) {
            hash = hash * 31 + c;
        }
        return hash;
    }
};

int main() {
    std::vector<std::string> keys = {"user:1001", "user:1002", "user:1003"};
    std::vector<std::string> values = {"Alice", "Bob", "Charlie"};
    
    // Build MPHF with gamma=2.0 (controls space/time tradeoff)
    double gamma = 2.0;
    auto data_iterator = boomphf::range(keys.begin(), keys.end());
    
    boomphf::mphf<std::string, StringHasher> mphf(
        keys.size(), data_iterator, 
        /*num_threads=*/4, gamma, /*verbose=*/false
    );
    
    // Create value array indexed by MPHF positions
    std::vector<std::string> indexed_values(keys.size());
    for (size_t i = 0; i < keys.size(); ++i) {
        uint64_t pos = mphf.lookup(keys[i]);
        indexed_values[pos] = values[i];
    }
    
    // O(1) lookup
    uint64_t pos = mphf.lookup("user:1002");
    std::cout << indexed_values[pos] << std::endl;  // "Bob"
    
    return 0;
}

cmph (C): The classic library, supports multiple algorithms.

pthash (C++): Modern implementation optimized for cache efficiency.

Limitations and When Not to Use MPHFs

MPHFs have hard constraints that make them unsuitable for many use cases:

Static keys only: Adding or removing a key requires rebuilding the entire function. If your key set changes, use cuckoo hashing or Robin Hood hashing instead.

No membership testing: An MPHF will return some integer for any input, even keys not in the original set. You need an additional Bloom filter or must store keys alongside values to detect invalid queries.

Construction overhead: Building the MPHF takes time and memory. For small key sets (under 1000 keys), a simple hash table is often faster overall.

Debugging difficulty: When something goes wrong, MPHF internals are opaque. Standard hash tables are easier to inspect and debug.

Use MPHFs when:

Your key set is fixed at build time
You’re optimizing for lookup latency, not flexibility
You have millions of keys where space savings matter
You can verify key existence through other means

Real-World Applications

Compiler symbol tables: GCC and Clang use perfect hashing for keyword recognition. The set of reserved words is fixed, and fast lookup is critical during lexical analysis.

Genomic k-mer indexing: Tools like Pufferfish use MPHFs to index billions of k-mers from reference genomes. The space savings (2-3 bits per k-mer vs. 8+ bytes for a hash table entry) make previously impossible analyses practical.

CDN routing: Content delivery networks use MPHFs to map content identifiers to server locations. With billions of objects, the space efficiency directly translates to cost savings.

Database query optimization: Static dimension tables in data warehouses can use MPHFs for foreign key lookups, eliminating hash collision overhead in join operations.

The pattern is consistent: when you have a large, static key set and need maximum lookup performance, minimal perfect hash functions deliver. They’re a specialized tool, but in the right context, they’re unbeatable.