Rabin-Karp Algorithm: Rolling Hash Pattern Search

Key Insights

Rabin-Karp uses a rolling hash to update the hash value in O(1) time as the search window slides, avoiding redundant computation and achieving O(n+m) average-case complexity.
The algorithm truly shines when searching for multiple patterns simultaneously—a use case where it outperforms KMP and Boyer-Moore by leveraging hash set lookups.
Hash collisions are inevitable, so always verify matches with actual string comparison; the algorithm’s correctness depends on this verification step.

Introduction to Pattern Matching Challenges

String pattern matching is one of those problems that seems trivial until you’re processing gigabytes of log files or scanning DNA sequences with billions of base pairs. The naive approach—slide the pattern across the text and compare character by character—works fine for small inputs but falls apart at scale.

Consider searching for a pattern of length m in a text of length n. The naive algorithm performs O(nm) comparisons in the worst case. When m is 1000 and n is 10 million, you’re looking at 10 billion comparisons. That’s not going to fly.

Several algorithms tackle this problem: Knuth-Morris-Pratt (KMP) builds a failure function to skip redundant comparisons, Boyer-Moore uses bad character and good suffix heuristics to skip large chunks of text. Rabin-Karp takes a different approach entirely—it uses hashing.

The genius of Rabin-Karp isn’t that it’s always the fastest single-pattern matcher (it’s not). Its power lies in the rolling hash concept and its natural extension to multiple pattern search. When you need to find any of 1000 different patterns in a document, Rabin-Karp becomes your best friend.

The Rolling Hash Concept

A hash function converts a string into a numeric fingerprint. If two strings have different hashes, they’re definitely different. If they have the same hash, they’re probably the same (but we must verify).

The simplest string hash treats characters as digits in a base-b number:

def polynomial_hash(s, base=256, mod=101):
    """
    Compute polynomial hash of string s.
    hash(s) = s[0]*base^(n-1) + s[1]*base^(n-2) + ... + s[n-1]
    """
    h = 0
    for char in s:
        h = (h * base + ord(char)) % mod
    return h

The key insight of Rabin-Karp is this: when you slide the window one position right, you don’t need to recompute the entire hash. You can “roll” it.

If you have the hash of text[i:i+m] and want the hash of text[i+1:i+m+1], you:

Subtract the contribution of the leftmost character
Multiply by the base (shift everything left)
Add the new rightmost character

def roll_hash(old_hash, old_char, new_char, base, mod, h):
    """
    Roll the hash window by one position.
    h = base^(m-1) mod mod, precomputed for efficiency.
    """
    new_hash = (old_hash - ord(old_char) * h) % mod
    new_hash = (new_hash * base + ord(new_char)) % mod
    return new_hash

This transforms an O(m) hash computation into O(1). That’s the entire algorithm’s foundation.

Rabin-Karp Algorithm Mechanics

Let’s build the complete algorithm step by step:

def rabin_karp(text, pattern):
    """
    Find all occurrences of pattern in text using Rabin-Karp.
    Returns list of starting indices.
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []
    
    base = 256  # Number of characters in alphabet
    mod = 101   # A prime number for modular arithmetic
    
    # Precompute h = base^(m-1) % mod
    h = pow(base, m - 1, mod)
    
    # Compute initial hashes
    pattern_hash = 0
    window_hash = 0
    
    for i in range(m):
        pattern_hash = (pattern_hash * base + ord(pattern[i])) % mod
        window_hash = (window_hash * base + ord(text[i])) % mod
    
    matches = []
    
    # Slide the window across text
    for i in range(n - m + 1):
        # Check if hashes match
        if pattern_hash == window_hash:
            # Verify actual match (handle collisions)
            if text[i:i+m] == pattern:
                matches.append(i)
        
        # Roll the hash for next window
        if i < n - m:
            window_hash = (window_hash - ord(text[i]) * h) % mod
            window_hash = (window_hash * base + ord(text[i + m])) % mod
    
    return matches

Here’s the same implementation in Java for those working in that ecosystem:

public class RabinKarp {
    private static final int BASE = 256;
    private static final int MOD = 101;
    
    public static List<Integer> search(String text, String pattern) {
        List<Integer> matches = new ArrayList<>();
        int n = text.length();
        int m = pattern.length();
        
        if (m > n) return matches;
        
        // Precompute h = BASE^(m-1) % MOD
        int h = 1;
        for (int i = 0; i < m - 1; i++) {
            h = (h * BASE) % MOD;
        }
        
        // Compute initial hashes
        int patternHash = 0;
        int windowHash = 0;
        
        for (int i = 0; i < m; i++) {
            patternHash = (patternHash * BASE + pattern.charAt(i)) % MOD;
            windowHash = (windowHash * BASE + text.charAt(i)) % MOD;
        }
        
        // Slide window
        for (int i = 0; i <= n - m; i++) {
            if (patternHash == windowHash) {
                if (text.substring(i, i + m).equals(pattern)) {
                    matches.add(i);
                }
            }
            
            if (i < n - m) {
                windowHash = (windowHash - text.charAt(i) * h) % MOD;
                windowHash = (windowHash * BASE + text.charAt(i + m)) % MOD;
                if (windowHash < 0) windowHash += MOD;
            }
        }
        
        return matches;
    }
}

Note the if (windowHash < 0) windowHash += MOD in Java—modular arithmetic with negative numbers requires careful handling.

Handling Hash Collisions

Hash collisions are not bugs; they’re a mathematical certainty. When you compress arbitrary-length strings into a fixed-range integer, multiple strings will share the same hash value.

The verification step text[i:i+m] == pattern is not optional. It’s what makes the algorithm correct.

def demonstrate_collision():
    """Show why verification matters."""
    # These might hash to the same value with small mod
    s1 = "abc"
    s2 = "xyz"
    
    mod = 7  # Intentionally small to force collision
    base = 256
    
    h1 = polynomial_hash(s1, base, mod)
    h2 = polynomial_hash(s2, base, mod)
    
    print(f"Hash of '{s1}': {h1}")
    print(f"Hash of '{s2}': {h2}")
    # With mod=7, collisions are frequent

To minimize collisions, choose your parameters wisely:

Use a large prime for mod: 10^9 + 7 is a common choice
Use a prime base: 31 or 37 work well for lowercase ASCII
Consider double hashing: Use two independent hash functions

def rabin_karp_double_hash(text, pattern):
    """Use two hash functions to reduce false positives."""
    # First hash: base=31, mod=10^9+7
    # Second hash: base=37, mod=10^9+9
    
    params = [(31, 10**9 + 7), (37, 10**9 + 9)]
    # Only verify when BOTH hashes match
    # Collision probability drops to approximately 1/(mod1 * mod2)

Multiple Pattern Search

This is where Rabin-Karp earns its keep. Searching for k patterns of equal length m in text of length n:

Naive: O(knm)
KMP/Boyer-Moore: O(k(n+m))
Rabin-Karp: O(n + km) average case

def multi_pattern_search(text, patterns):
    """
    Search for multiple patterns of equal length simultaneously.
    """
    if not patterns:
        return {}
    
    m = len(patterns[0])
    if not all(len(p) == m for p in patterns):
        raise ValueError("All patterns must have equal length")
    
    n = len(text)
    base, mod = 256, 10**9 + 7
    
    # Precompute h
    h = pow(base, m - 1, mod)
    
    # Build hash set of pattern hashes
    pattern_hashes = {}
    for pattern in patterns:
        ph = 0
        for c in pattern:
            ph = (ph * base + ord(c)) % mod
        if ph not in pattern_hashes:
            pattern_hashes[ph] = []
        pattern_hashes[ph].append(pattern)
    
    # Compute initial window hash
    window_hash = 0
    for i in range(m):
        window_hash = (window_hash * base + ord(text[i])) % mod
    
    results = {p: [] for p in patterns}
    
    for i in range(n - m + 1):
        if window_hash in pattern_hashes:
            # Check all patterns with this hash
            substring = text[i:i+m]
            for pattern in pattern_hashes[window_hash]:
                if substring == pattern:
                    results[pattern].append(i)
        
        if i < n - m:
            window_hash = (window_hash - ord(text[i]) * h) % mod
            window_hash = (window_hash * base + ord(text[i + m])) % mod
    
    return results

The hash set lookup is O(1) average case, making the total complexity independent of k for the main loop.

Practical Applications

Here’s a simplified plagiarism detector that finds common phrases between documents:

def find_common_phrases(doc1, doc2, phrase_length=10):
    """
    Find common phrases of given length between two documents.
    Returns list of matching phrases.
    """
    base, mod = 31, 10**9 + 7
    h = pow(base, phrase_length - 1, mod)
    
    # Build hash table from first document
    doc1_hashes = {}
    if len(doc1) >= phrase_length:
        window_hash = 0
        for i in range(phrase_length):
            window_hash = (window_hash * base + ord(doc1[i])) % mod
        
        for i in range(len(doc1) - phrase_length + 1):
            phrase = doc1[i:i+phrase_length]
            if window_hash not in doc1_hashes:
                doc1_hashes[window_hash] = phrase
            
            if i < len(doc1) - phrase_length:
                window_hash = (window_hash - ord(doc1[i]) * h) % mod
                window_hash = (window_hash * base + ord(doc1[i + phrase_length])) % mod
    
    # Scan second document for matches
    common_phrases = set()
    if len(doc2) >= phrase_length:
        window_hash = 0
        for i in range(phrase_length):
            window_hash = (window_hash * base + ord(doc2[i])) % mod
        
        for i in range(len(doc2) - phrase_length + 1):
            if window_hash in doc1_hashes:
                phrase = doc2[i:i+phrase_length]
                if phrase == doc1_hashes[window_hash]:
                    common_phrases.add(phrase)
            
            if i < len(doc2) - phrase_length:
                window_hash = (window_hash - ord(doc2[i]) * h) % mod
                window_hash = (window_hash * base + ord(doc2[i + phrase_length])) % mod
    
    return list(common_phrases)

This same technique powers DNA sequence alignment, spam filter signature matching, and duplicate content detection.

Performance Considerations and Trade-offs

Average case: O(n + m) — each position requires O(1) hash update and comparison.

Worst case: O(nm) — occurs when every position produces a hash match but fails verification. This happens with pathological inputs or poor hash function choices.

When to choose Rabin-Karp:

Multiple pattern search (its killer feature)
Patterns of varying lengths (compute hashes for each length)
When you need simplicity over guaranteed worst-case performance

When to avoid it:

Single pattern search where worst-case guarantees matter (use KMP)
Very short patterns (overhead isn’t worth it)
Memory-constrained environments searching many patterns

Optimization tips:

Use modular exponentiation for computing h
Precompute pattern hashes before scanning
Consider SIMD for hash computation on modern CPUs
For very long patterns, use multiple hash functions instead of larger modulus

Rabin-Karp won’t win benchmarks against highly optimized Boyer-Moore implementations for single-pattern search. But when you’re building a system that needs to match documents against thousands of known signatures, its elegant simplicity and natural parallelization make it the pragmatic choice.