Rabin-Karp Algorithm: Rolling Hash Pattern Search
String pattern matching is one of those problems that seems trivial until you're processing gigabytes of log files or scanning DNA sequences with billions of base pairs. The naive approach—slide the...
Key Insights
- Rabin-Karp uses a rolling hash to update the hash value in O(1) time as the search window slides, avoiding redundant computation and achieving O(n+m) average-case complexity.
- The algorithm truly shines when searching for multiple patterns simultaneously—a use case where it outperforms KMP and Boyer-Moore by leveraging hash set lookups.
- Hash collisions are inevitable, so always verify matches with actual string comparison; the algorithm’s correctness depends on this verification step.
Introduction to Pattern Matching Challenges
String pattern matching is one of those problems that seems trivial until you’re processing gigabytes of log files or scanning DNA sequences with billions of base pairs. The naive approach—slide the pattern across the text and compare character by character—works fine for small inputs but falls apart at scale.
Consider searching for a pattern of length m in a text of length n. The naive algorithm performs O(nm) comparisons in the worst case. When m is 1000 and n is 10 million, you’re looking at 10 billion comparisons. That’s not going to fly.
Several algorithms tackle this problem: Knuth-Morris-Pratt (KMP) builds a failure function to skip redundant comparisons, Boyer-Moore uses bad character and good suffix heuristics to skip large chunks of text. Rabin-Karp takes a different approach entirely—it uses hashing.
The genius of Rabin-Karp isn’t that it’s always the fastest single-pattern matcher (it’s not). Its power lies in the rolling hash concept and its natural extension to multiple pattern search. When you need to find any of 1000 different patterns in a document, Rabin-Karp becomes your best friend.
The Rolling Hash Concept
A hash function converts a string into a numeric fingerprint. If two strings have different hashes, they’re definitely different. If they have the same hash, they’re probably the same (but we must verify).
The simplest string hash treats characters as digits in a base-b number:
def polynomial_hash(s, base=256, mod=101):
"""
Compute polynomial hash of string s.
hash(s) = s[0]*base^(n-1) + s[1]*base^(n-2) + ... + s[n-1]
"""
h = 0
for char in s:
h = (h * base + ord(char)) % mod
return h
The key insight of Rabin-Karp is this: when you slide the window one position right, you don’t need to recompute the entire hash. You can “roll” it.
If you have the hash of text[i:i+m] and want the hash of text[i+1:i+m+1], you:
- Subtract the contribution of the leftmost character
- Multiply by the base (shift everything left)
- Add the new rightmost character
def roll_hash(old_hash, old_char, new_char, base, mod, h):
"""
Roll the hash window by one position.
h = base^(m-1) mod mod, precomputed for efficiency.
"""
new_hash = (old_hash - ord(old_char) * h) % mod
new_hash = (new_hash * base + ord(new_char)) % mod
return new_hash
This transforms an O(m) hash computation into O(1). That’s the entire algorithm’s foundation.
Rabin-Karp Algorithm Mechanics
Let’s build the complete algorithm step by step:
def rabin_karp(text, pattern):
"""
Find all occurrences of pattern in text using Rabin-Karp.
Returns list of starting indices.
"""
n, m = len(text), len(pattern)
if m > n:
return []
base = 256 # Number of characters in alphabet
mod = 101 # A prime number for modular arithmetic
# Precompute h = base^(m-1) % mod
h = pow(base, m - 1, mod)
# Compute initial hashes
pattern_hash = 0
window_hash = 0
for i in range(m):
pattern_hash = (pattern_hash * base + ord(pattern[i])) % mod
window_hash = (window_hash * base + ord(text[i])) % mod
matches = []
# Slide the window across text
for i in range(n - m + 1):
# Check if hashes match
if pattern_hash == window_hash:
# Verify actual match (handle collisions)
if text[i:i+m] == pattern:
matches.append(i)
# Roll the hash for next window
if i < n - m:
window_hash = (window_hash - ord(text[i]) * h) % mod
window_hash = (window_hash * base + ord(text[i + m])) % mod
return matches
Here’s the same implementation in Java for those working in that ecosystem:
public class RabinKarp {
private static final int BASE = 256;
private static final int MOD = 101;
public static List<Integer> search(String text, String pattern) {
List<Integer> matches = new ArrayList<>();
int n = text.length();
int m = pattern.length();
if (m > n) return matches;
// Precompute h = BASE^(m-1) % MOD
int h = 1;
for (int i = 0; i < m - 1; i++) {
h = (h * BASE) % MOD;
}
// Compute initial hashes
int patternHash = 0;
int windowHash = 0;
for (int i = 0; i < m; i++) {
patternHash = (patternHash * BASE + pattern.charAt(i)) % MOD;
windowHash = (windowHash * BASE + text.charAt(i)) % MOD;
}
// Slide window
for (int i = 0; i <= n - m; i++) {
if (patternHash == windowHash) {
if (text.substring(i, i + m).equals(pattern)) {
matches.add(i);
}
}
if (i < n - m) {
windowHash = (windowHash - text.charAt(i) * h) % MOD;
windowHash = (windowHash * BASE + text.charAt(i + m)) % MOD;
if (windowHash < 0) windowHash += MOD;
}
}
return matches;
}
}
Note the if (windowHash < 0) windowHash += MOD in Java—modular arithmetic with negative numbers requires careful handling.
Handling Hash Collisions
Hash collisions are not bugs; they’re a mathematical certainty. When you compress arbitrary-length strings into a fixed-range integer, multiple strings will share the same hash value.
The verification step text[i:i+m] == pattern is not optional. It’s what makes the algorithm correct.
def demonstrate_collision():
"""Show why verification matters."""
# These might hash to the same value with small mod
s1 = "abc"
s2 = "xyz"
mod = 7 # Intentionally small to force collision
base = 256
h1 = polynomial_hash(s1, base, mod)
h2 = polynomial_hash(s2, base, mod)
print(f"Hash of '{s1}': {h1}")
print(f"Hash of '{s2}': {h2}")
# With mod=7, collisions are frequent
To minimize collisions, choose your parameters wisely:
- Use a large prime for mod: 10^9 + 7 is a common choice
- Use a prime base: 31 or 37 work well for lowercase ASCII
- Consider double hashing: Use two independent hash functions
def rabin_karp_double_hash(text, pattern):
"""Use two hash functions to reduce false positives."""
# First hash: base=31, mod=10^9+7
# Second hash: base=37, mod=10^9+9
params = [(31, 10**9 + 7), (37, 10**9 + 9)]
# Only verify when BOTH hashes match
# Collision probability drops to approximately 1/(mod1 * mod2)
Multiple Pattern Search
This is where Rabin-Karp earns its keep. Searching for k patterns of equal length m in text of length n:
- Naive: O(knm)
- KMP/Boyer-Moore: O(k(n+m))
- Rabin-Karp: O(n + km) average case
def multi_pattern_search(text, patterns):
"""
Search for multiple patterns of equal length simultaneously.
"""
if not patterns:
return {}
m = len(patterns[0])
if not all(len(p) == m for p in patterns):
raise ValueError("All patterns must have equal length")
n = len(text)
base, mod = 256, 10**9 + 7
# Precompute h
h = pow(base, m - 1, mod)
# Build hash set of pattern hashes
pattern_hashes = {}
for pattern in patterns:
ph = 0
for c in pattern:
ph = (ph * base + ord(c)) % mod
if ph not in pattern_hashes:
pattern_hashes[ph] = []
pattern_hashes[ph].append(pattern)
# Compute initial window hash
window_hash = 0
for i in range(m):
window_hash = (window_hash * base + ord(text[i])) % mod
results = {p: [] for p in patterns}
for i in range(n - m + 1):
if window_hash in pattern_hashes:
# Check all patterns with this hash
substring = text[i:i+m]
for pattern in pattern_hashes[window_hash]:
if substring == pattern:
results[pattern].append(i)
if i < n - m:
window_hash = (window_hash - ord(text[i]) * h) % mod
window_hash = (window_hash * base + ord(text[i + m])) % mod
return results
The hash set lookup is O(1) average case, making the total complexity independent of k for the main loop.
Practical Applications
Here’s a simplified plagiarism detector that finds common phrases between documents:
def find_common_phrases(doc1, doc2, phrase_length=10):
"""
Find common phrases of given length between two documents.
Returns list of matching phrases.
"""
base, mod = 31, 10**9 + 7
h = pow(base, phrase_length - 1, mod)
# Build hash table from first document
doc1_hashes = {}
if len(doc1) >= phrase_length:
window_hash = 0
for i in range(phrase_length):
window_hash = (window_hash * base + ord(doc1[i])) % mod
for i in range(len(doc1) - phrase_length + 1):
phrase = doc1[i:i+phrase_length]
if window_hash not in doc1_hashes:
doc1_hashes[window_hash] = phrase
if i < len(doc1) - phrase_length:
window_hash = (window_hash - ord(doc1[i]) * h) % mod
window_hash = (window_hash * base + ord(doc1[i + phrase_length])) % mod
# Scan second document for matches
common_phrases = set()
if len(doc2) >= phrase_length:
window_hash = 0
for i in range(phrase_length):
window_hash = (window_hash * base + ord(doc2[i])) % mod
for i in range(len(doc2) - phrase_length + 1):
if window_hash in doc1_hashes:
phrase = doc2[i:i+phrase_length]
if phrase == doc1_hashes[window_hash]:
common_phrases.add(phrase)
if i < len(doc2) - phrase_length:
window_hash = (window_hash - ord(doc2[i]) * h) % mod
window_hash = (window_hash * base + ord(doc2[i + phrase_length])) % mod
return list(common_phrases)
This same technique powers DNA sequence alignment, spam filter signature matching, and duplicate content detection.
Performance Considerations and Trade-offs
Average case: O(n + m) — each position requires O(1) hash update and comparison.
Worst case: O(nm) — occurs when every position produces a hash match but fails verification. This happens with pathological inputs or poor hash function choices.
When to choose Rabin-Karp:
- Multiple pattern search (its killer feature)
- Patterns of varying lengths (compute hashes for each length)
- When you need simplicity over guaranteed worst-case performance
When to avoid it:
- Single pattern search where worst-case guarantees matter (use KMP)
- Very short patterns (overhead isn’t worth it)
- Memory-constrained environments searching many patterns
Optimization tips:
- Use modular exponentiation for computing h
- Precompute pattern hashes before scanning
- Consider SIMD for hash computation on modern CPUs
- For very long patterns, use multiple hash functions instead of larger modulus
Rabin-Karp won’t win benchmarks against highly optimized Boyer-Moore implementations for single-pattern search. But when you’re building a system that needs to match documents against thousands of known signatures, its elegant simplicity and natural parallelization make it the pragmatic choice.