String Hashing: Polynomial Rolling Hash
String comparison is expensive. Comparing two strings of length n requires O(n) time in the worst case. When you need to find a pattern in text, check for duplicates in a collection, or build a hash...
Key Insights
- Polynomial rolling hash converts strings to integers by treating characters as coefficients of a polynomial, enabling O(1) hash updates when sliding a window across text.
- The “rolling” property comes from the mathematical relationship between adjacent windows—you can compute the next hash by subtracting the outgoing character’s contribution and adding the incoming character’s, avoiding full recomputation.
- Double hashing with two independent hash functions reduces collision probability from 1/m to 1/(m₁ × m₂), making false positives virtually impossible for practical string matching applications.
Introduction to String Hashing
String comparison is expensive. Comparing two strings of length n requires O(n) time in the worst case. When you need to find a pattern in text, check for duplicates in a collection, or build a hash table with string keys, these linear comparisons add up fast.
String hashing solves this by converting strings to integers. Once you have integers, comparison becomes O(1). The challenge is designing a hash function that minimizes collisions while remaining fast to compute.
Polynomial rolling hash is the workhorse algorithm for this problem. It’s the foundation of Rabin-Karp string matching, powers plagiarism detection systems, and shows up in competitive programming constantly. Understanding it deeply will make you a better engineer.
The Polynomial Hash Function
The core idea is elegant: treat a string as a polynomial where each character is a coefficient.
For a string s of length n, the hash is:
hash(s) = s[0] + s[1]*p + s[2]*p² + s[3]*p³ + ... + s[n-1]*p^(n-1) mod m
Here, p is the base (a prime number) and m is the modulus (typically a large prime). Each character’s ASCII value becomes a coefficient, and its position determines the power of p.
Why this works: different strings produce different polynomials, and evaluating them at point p gives different results. The modulus keeps numbers manageable and creates a finite hash space.
Choosing constants matters:
- Base p: Should be larger than your alphabet size. For lowercase letters, p = 31 works. For all ASCII, use p = 53 or larger. Prime bases distribute hashes more uniformly.
- Modulus m: Use a large prime. Common choices are 10⁹ + 7 or 10⁹ + 9. Larger moduli mean fewer collisions but require careful overflow handling.
def polynomial_hash(s: str, p: int = 31, m: int = 10**9 + 9) -> int:
"""Compute polynomial hash of a string."""
hash_value = 0
p_pow = 1
for char in s:
# Subtract 'a' to get values 0-25 for lowercase letters
# Add 1 to avoid 0 (which would make leading chars meaningless)
char_value = ord(char) - ord('a') + 1
hash_value = (hash_value + char_value * p_pow) % m
p_pow = (p_pow * p) % m
return hash_value
This implementation runs in O(n) time and uses O(1) space. Note the + 1 offset—without it, strings with leading ‘a’ characters would hash identically to shorter strings (since 0 × p^k = 0).
The Rolling Property
Here’s where polynomial hashing becomes powerful. When you slide a window across a string, you don’t need to recompute the entire hash. You can update it in O(1).
Consider a window of length L. When you slide it one position right:
- Remove the leftmost character’s contribution
- Divide by p (shift all remaining terms down one power)
- Add the new rightmost character at position L-1
The formula for updating hash when moving from position i to i+1:
new_hash = (old_hash - s[i] * 1) / p + s[i+L] * p^(L-1)
Division by p in modular arithmetic requires the modular multiplicative inverse. Alternatively, we can restructure to avoid division:
def rolling_hash_windows(s: str, window_len: int, p: int = 31, m: int = 10**9 + 9) -> list:
"""Generate hashes for all windows of given length using rolling hash."""
if len(s) < window_len:
return []
hashes = []
# Precompute p^window_len for removing leftmost character
p_pow_window = pow(p, window_len, m)
# Compute hash of first window
current_hash = 0
p_pow = 1
for i in range(window_len):
char_value = ord(s[i]) - ord('a') + 1
current_hash = (current_hash + char_value * p_pow) % m
p_pow = (p_pow * p) % m
hashes.append(current_hash)
# Roll through remaining windows
p_pow = 1
for i in range(1, len(s) - window_len + 1):
# Remove contribution of outgoing character (at position i-1)
outgoing = ord(s[i - 1]) - ord('a') + 1
current_hash = (current_hash - outgoing + m) % m
# Divide by p (multiply by modular inverse)
# For simplicity, we use a different formulation:
# Multiply by p_inverse, which is pow(p, m-2, m) by Fermat's little theorem
p_inverse = pow(p, m - 2, m)
current_hash = (current_hash * p_inverse) % m
# Add contribution of incoming character
incoming = ord(s[i + window_len - 1]) - ord('a') + 1
current_hash = (current_hash + incoming * pow(p, window_len - 1, m)) % m
hashes.append(current_hash)
return hashes
Each window after the first takes O(1) time to compute (assuming precomputed powers). For n windows, total time is O(n).
Substring Hash Computation
Often you need hashes of arbitrary substrings, not just sliding windows. The prefix hash technique enables O(1) substring hash queries after O(n) preprocessing.
Compute prefix hashes where prefix[i] is the hash of s[0..i-1]. Then the hash of substring s[l..r] can be derived from prefix[r+1] - prefix[l], adjusted for the power offset.
class SubstringHasher:
"""Precompute prefix hashes for O(1) substring hash queries."""
def __init__(self, s: str, p: int = 31, m: int = 10**9 + 9):
self.s = s
self.p = p
self.m = m
self.n = len(s)
# Precompute prefix hashes
self.prefix = [0] * (self.n + 1)
# Precompute powers of p
self.p_pow = [1] * (self.n + 1)
for i in range(self.n):
char_value = ord(s[i]) - ord('a') + 1
self.prefix[i + 1] = (self.prefix[i] + char_value * self.p_pow[i]) % m
self.p_pow[i + 1] = (self.p_pow[i] * p) % m
def get_hash(self, left: int, right: int) -> int:
"""Get hash of substring s[left:right+1] in O(1)."""
# Raw difference of prefix hashes
raw_hash = (self.prefix[right + 1] - self.prefix[left] + self.m) % self.m
# Normalize by dividing out p^left (multiply by inverse)
p_inv = pow(self.p_pow[left], self.m - 2, self.m)
return (raw_hash * p_inv) % self.m
def compare_substrings(self, l1: int, r1: int, l2: int, r2: int) -> bool:
"""Check if two substrings are equal via hash comparison."""
if r1 - l1 != r2 - l2:
return False
return self.get_hash(l1, r1) == self.get_hash(l2, r2)
This is invaluable for problems involving multiple substring comparisons—longest common substring, palindrome detection, and string matching all benefit.
Practical Application: Rabin-Karp Pattern Matching
Rabin-Karp uses rolling hash to find pattern occurrences in text. Compute the pattern’s hash once, then roll through the text comparing hashes. When hashes match, verify with direct comparison to handle collisions.
def rabin_karp(text: str, pattern: str, p: int = 31, m: int = 10**9 + 9) -> list:
"""Find all occurrences of pattern in text using Rabin-Karp."""
n, k = len(text), len(pattern)
if k > n:
return []
# Compute pattern hash
pattern_hash = 0
p_pow = 1
for i, char in enumerate(pattern):
char_value = ord(char) - ord('a') + 1
pattern_hash = (pattern_hash + char_value * p_pow) % m
if i < k - 1:
p_pow = (p_pow * p) % m
# Precompute powers
p_pows = [1] * (n + 1)
for i in range(1, n + 1):
p_pows[i] = (p_pows[i-1] * p) % m
# Compute prefix hashes of text
prefix = [0] * (n + 1)
for i in range(n):
char_value = ord(text[i]) - ord('a') + 1
prefix[i + 1] = (prefix[i] + char_value * p_pows[i]) % m
matches = []
p_inv_base = pow(p, m - 2, m)
for i in range(n - k + 1):
# Hash of text[i:i+k]
raw_hash = (prefix[i + k] - prefix[i] + m) % m
window_hash = (raw_hash * pow(p_pows[i], m - 2, m)) % m
if window_hash == pattern_hash:
# Verify to handle collision
if text[i:i+k] == pattern:
matches.append(i)
return matches
Average case is O(n + m). Worst case degrades to O(nm) when every window produces a hash collision, but this is rare with good constants.
Collision Handling and Double Hashing
The birthday paradox bites hard. With modulus m, after about √m strings, you expect a collision. For m = 10⁹, that’s only ~31,000 strings before collisions become likely.
Double hashing uses two independent hash functions with different bases and moduli. A false positive requires collision in both—probability drops from 1/m to 1/(m₁ × m₂).
class DoubleHash:
"""Double hashing for reduced collision probability."""
def __init__(self, s: str):
self.s = s
self.n = len(s)
# Two independent hash functions
self.p1, self.m1 = 31, 10**9 + 9
self.p2, self.m2 = 37, 10**9 + 7
# Precompute for both
self.prefix1 = self._build_prefix(self.p1, self.m1)
self.prefix2 = self._build_prefix(self.p2, self.m2)
self.pow1 = self._build_powers(self.p1, self.m1)
self.pow2 = self._build_powers(self.p2, self.m2)
def _build_prefix(self, p: int, m: int) -> list:
prefix = [0] * (self.n + 1)
p_pow = 1
for i, char in enumerate(self.s):
char_value = ord(char) - ord('a') + 1
prefix[i + 1] = (prefix[i] + char_value * p_pow) % m
p_pow = (p_pow * p) % m
return prefix
def _build_powers(self, p: int, m: int) -> list:
powers = [1] * (self.n + 1)
for i in range(1, self.n + 1):
powers[i] = (powers[i-1] * p) % m
return powers
def get_hash_pair(self, left: int, right: int) -> tuple:
"""Return both hashes as a tuple for comparison."""
h1 = self._get_single_hash(left, right, self.prefix1, self.pow1, self.m1)
h2 = self._get_single_hash(left, right, self.prefix2, self.pow2, self.m2)
return (h1, h2)
def _get_single_hash(self, left: int, right: int, prefix: list,
powers: list, m: int) -> int:
raw = (prefix[right + 1] - prefix[left] + m) % m
return (raw * pow(powers[left], m - 2, m)) % m
With m₁ × m₂ ≈ 10¹⁸, false positives become astronomically unlikely. This is the production-grade approach.
Performance Considerations and Pitfalls
Overflow: In languages without arbitrary precision integers, 64-bit overflow is a real concern. In C++, use unsigned long long and be careful with intermediate products. Python handles this automatically.
Modular inverse cost: Computing pow(x, m-2, m) is O(log m). If you’re computing many substring hashes, precompute inverses or restructure to avoid them.
When not to use polynomial hashing: For hash tables with untrusted input, adversaries can craft collision attacks. Use cryptographic hashes (SHA-256) for security-critical applications. Polynomial hashing is for performance, not security.
Alternative approaches: Suffix arrays and suffix trees offer O(1) substring comparison after O(n) or O(n log n) preprocessing, with no false positives. They’re more complex but deterministically correct.
Polynomial rolling hash remains the go-to choice when you need fast, simple string hashing with controllable collision probability. Master it, and you’ll reach for it constantly.