Trie-Based Pattern Matching: Multiple Pattern Search
You have a list of 10,000 banned words and need to scan every user comment for violations. The naive approach—running a single-pattern search algorithm 10,000 times per comment—is computationally...
Key Insights
- Aho-Corasick transforms multiple pattern search from O(n × m × k) to O(n + m + z), making it essential for scanning text against large pattern dictionaries like spam filters or intrusion detection systems.
- The algorithm’s power comes from failure links that eliminate backtracking—once you’ve read a character from the input, you never read it again, regardless of how many patterns you’re matching.
- For pattern sets under 50-100 entries, simpler approaches often win; Aho-Corasick shines when you have hundreds or thousands of patterns and need to scan large volumes of text.
Introduction to Multi-Pattern Search
You have a list of 10,000 banned words and need to scan every user comment for violations. The naive approach—running a single-pattern search algorithm 10,000 times per comment—is computationally brutal. If each comment averages 500 characters and you use a linear search, you’re looking at 5 million character comparisons per comment. At scale, this becomes untenable.
Multi-pattern search algorithms solve this by matching all patterns simultaneously in a single pass through the text. Instead of asking “does this text contain pattern A?” then “does it contain pattern B?” and so on, you ask “which patterns from this set does this text contain?” and get all answers at once.
This matters in real systems: spam filters checking messages against known spam signatures, intrusion detection systems scanning network packets for attack patterns, DNA sequence matching against databases of known genes, and log analyzers hunting for multiple error patterns. The Aho-Corasick algorithm, published in 1975, remains the gold standard for this problem class.
Trie Fundamentals Refresher
A trie (from “retrieval”) is a tree where each node represents a character, and paths from root to nodes spell out strings. Unlike hash tables that treat strings atomically, tries expose the structure of your string set, making prefix-based operations natural.
Here’s a basic trie implementation:
class TrieNode:
def __init__(self):
self.children = {}
self.is_end = False
self.pattern = None # Store the actual pattern at terminal nodes
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, pattern: str) -> None:
node = self.root
for char in pattern:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end = True
node.pattern = pattern
def search(self, text: str) -> bool:
node = self.root
for char in text:
if char not in node.children:
return False
node = node.children[char]
return node.is_end
def starts_with(self, prefix: str) -> bool:
node = self.root
for char in prefix:
if char not in node.children:
return False
node = node.children[char]
return True
The key insight: if you insert patterns “he”, “her”, “hers”, and “she”, the trie shares the common prefix structure. Searching for any pattern starting with “he” traverses the same initial path. This sharing is what makes multi-pattern matching efficient.
The Aho-Corasick Algorithm
Aho-Corasick extends the trie with two additional link types that transform it into a finite automaton:
Failure links point from each node to the longest proper suffix of the current path that’s also a prefix of some pattern. When a character match fails, instead of restarting from the root, you follow the failure link and continue—preserving work already done.
Output links chain together nodes that should emit matches. If you’re at a node representing “she” and “he” is also a pattern, the output link connects them so both matches are reported.
from collections import deque
class AhoCorasick:
def __init__(self):
self.root = TrieNode()
self.root.fail = self.root # Root fails to itself
def insert(self, pattern: str) -> None:
node = self.root
for char in pattern:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end = True
node.pattern = pattern
def build(self) -> None:
"""Build failure links using BFS."""
queue = deque()
# Initialize: depth-1 nodes fail to root
for char, child in self.root.children.items():
child.fail = self.root
child.output = []
queue.append(child)
# BFS to build failure links for deeper nodes
while queue:
current = queue.popleft()
for char, child in current.children.items():
queue.append(child)
# Find failure link: walk up failure chain
# until we find a node with this character
fail_node = current.fail
while fail_node != self.root and char not in fail_node.children:
fail_node = fail_node.fail
child.fail = fail_node.children.get(char, self.root)
if child.fail == child: # Avoid self-loops
child.fail = self.root
# Build output links: collect all patterns
# reachable via failure chain
child.output = []
if child.fail.is_end:
child.output.append(child.fail.pattern)
child.output.extend(getattr(child.fail, 'output', []))
The BFS traversal ensures we process nodes level by level. When computing a failure link for node N, we know all nodes at shallower depths already have their failure links computed. We walk up the failure chain of N’s parent until we find a node with a child matching our character, or we hit the root.
Time complexity: O(m) for construction where m is total length of all patterns, O(n) for scanning where n is text length, plus O(z) for outputting z matches. The critical property is that this is independent of the number of patterns.
Walking the Automaton
Once built, the automaton processes text in a single linear scan. At each character, you either follow a child edge (match) or follow failure links until you can proceed. The key invariant: you never re-read input characters.
def search(self, text: str) -> list:
"""
Search text for all pattern matches.
Returns list of (end_position, pattern) tuples.
"""
results = []
node = self.root
for i, char in enumerate(text):
# Follow failure links until we can proceed
while node != self.root and char not in node.children:
node = node.fail
# Move to next state
node = node.children.get(char, self.root)
# Collect all matches at this position
if node.is_end:
results.append((i, node.pattern))
# Check output links for overlapping shorter patterns
for pattern in getattr(node, 'output', []):
results.append((i, pattern))
return results
Consider searching “ushers” with patterns [“he”, “she”, “his”, “hers”]. When you reach position 4 (the ’s’ after “usher”), you’re at the node for “hers”. But “she” and “he” are also present as suffixes—the output links ensure all three matches are reported without rescanning.
Overlapping matches are handled naturally. If your text is “shers” and you have patterns [“she”, “he”, “her”, “hers”], all four matches are found in one pass.
Practical Optimizations
The hashmap-based children structure is flexible but memory-hungry. For known alphabets, array-based children dramatically reduce overhead:
class ArrayTrieNode:
"""Optimized node for lowercase ASCII alphabet."""
ALPHABET_SIZE = 26
def __init__(self):
self.children = [None] * self.ALPHABET_SIZE
self.fail = None
self.is_end = False
self.pattern = None
self.output = []
def get_child(self, char: str):
idx = ord(char) - ord('a')
if 0 <= idx < self.ALPHABET_SIZE:
return self.children[idx]
return None
def set_child(self, char: str, node):
idx = ord(char) - ord('a')
if 0 <= idx < self.ALPHABET_SIZE:
self.children[idx] = node
Memory comparison for 10,000 patterns averaging 10 characters:
import sys
# Approximate memory per node
hashmap_node = sys.getsizeof({}) + 8 * 4 # Empty dict + avg 4 entries
array_node_26 = 26 * 8 # 26 pointers
# For 100,000 nodes (10k patterns × 10 chars average)
hashmap_total = 100_000 * hashmap_node # ~15-20 MB
array_total = 100_000 * array_node_26 # ~20 MB
# But arrays have better cache locality and no hash overhead
# For smaller alphabets (DNA: 4 chars), arrays win decisively:
array_node_4 = 4 * 8 # Only 32 bytes per node
For DNA sequences (alphabet of 4), array nodes use 32 bytes versus 200+ for hashmaps. For large pattern sets with sparse transitions, consider hybrid approaches: arrays for the first few levels (high branching factor near root), hashmaps for deeper nodes.
Real-World Applications
Here’s a practical URL blocklist checker:
class URLBlocklistChecker:
def __init__(self, blocked_patterns: list):
self.ac = AhoCorasick()
for pattern in blocked_patterns:
self.ac.insert(pattern.lower())
self.ac.build()
def check(self, url: str) -> dict:
"""Check URL against blocklist, return matches."""
url_lower = url.lower()
matches = self.ac.search(url_lower)
if not matches:
return {"blocked": False, "url": url, "matches": []}
return {
"blocked": True,
"url": url,
"matches": [
{"pattern": pattern, "position": pos}
for pos, pattern in matches
]
}
# Usage
blocklist = [
"malware.com",
"phishing",
"evil-domain",
"tracking.js",
"cryptominer"
]
checker = URLBlocklistChecker(blocklist)
urls = [
"https://safe-site.com/page",
"https://malware.com/download",
"https://example.com/phishing-attempt",
"https://cdn.example.com/tracking.js?v=2"
]
for url in urls:
result = checker.check(url)
status = "BLOCKED" if result["blocked"] else "OK"
print(f"{status}: {url}")
if result["blocked"]:
for match in result["matches"]:
print(f" - Found '{match['pattern']}' at position {match['position']}")
When to choose Aho-Corasick over alternatives:
- vs. repeated regex: Aho-Corasick wins when you have many literal patterns. Regex engines optimize single complex patterns but struggle with thousands of alternatives.
- vs. suffix arrays: Suffix arrays excel at finding arbitrary substrings in a fixed text. Aho-Corasick excels at finding fixed patterns in arbitrary text.
- vs. Bloom filters: Bloom filters give probabilistic membership testing. Aho-Corasick gives exact positions of exact matches.
Conclusion
Aho-Corasick is the right tool when you have a stable set of patterns and need to scan varying input text efficiently. The upfront construction cost (O(m) for total pattern length) amortizes across many searches.
For small pattern sets (under 50-100 patterns), simpler approaches often suffice—the constant factors in Aho-Corasick may not pay off. But once you’re scanning logs against hundreds of error signatures or checking content against thousands of banned phrases, nothing else comes close.
Production implementations worth considering: pyahocorasick for Python (C extension, very fast), aho-corasick crate for Rust, and esmre for Python when you need regex integration. Build the automaton once, reuse it across millions of searches, and enjoy linear-time multi-pattern matching.