Aho-Corasick Algorithm: Multi-Pattern Matching
You need to scan a document for 10,000 banned words. Or detect any of 50,000 malware signatures in a binary. Or find all occurrences of thousands of DNA motifs in a genome. The naive approach—running...
Key Insights
- Aho-Corasick searches for multiple patterns simultaneously in O(n + m + z) time, where n is text length, m is total pattern length, and z is match count—dramatically outperforming naive approaches that scale linearly with pattern count.
- The algorithm combines a trie with failure links (similar to KMP) to avoid backtracking, processing each input character exactly once regardless of how many patterns you’re searching for.
- Real-world systems like intrusion detection (Snort), antivirus scanners, and DNA sequencing tools rely on Aho-Corasick because it handles thousands of patterns without performance degradation.
The Multi-Pattern Matching Problem
You need to scan a document for 10,000 banned words. Or detect any of 50,000 malware signatures in a binary. Or find all occurrences of thousands of DNA motifs in a genome. The naive approach—running a single-pattern search for each pattern—kills performance.
Consider the math. Searching for one pattern of length m in text of length n takes O(n × m) with naive string matching. For k patterns, you’re looking at O(n × m × k). With 10,000 patterns averaging 20 characters in a 1MB document, that’s roughly 200 billion character comparisons. Unacceptable.
Alfred Aho and Margaret Corasick solved this in 1975 while working at Bell Labs. Their insight: build a finite automaton from all patterns, then scan the text exactly once. The automaton handles all patterns simultaneously, giving you O(n + m + z) complexity where z is the number of matches found. Pattern count doesn’t appear in the complexity—that’s the breakthrough.
Core Concepts: Automaton Structure
Aho-Corasick builds on the trie data structure. A trie stores patterns as paths from root to leaf, sharing common prefixes. If you’re searching for “he”, “her”, “his”, and “she”, the trie shares the ‘h’ prefix between the first three patterns.
The algorithm adds two critical components to the basic trie:
Goto function: Standard trie transitions. From any state, given a character, move to the appropriate child (or fail).
Failure function: When a goto transition doesn’t exist, where do we jump? The failure link points to the longest proper suffix of the current path that’s also a prefix of some pattern. This is the KMP trick generalized to multiple patterns.
Output function: Which patterns match at each state? A state might match multiple patterns if one pattern is a suffix of another (“he” matches when we reach the “he” state in “she”).
Here’s the basic node structure:
class AhoCorasickNode:
def __init__(self):
self.children = {} # char -> node
self.failure = None # failure link
self.output = [] # patterns that match at this state
self.depth = 0 # depth in trie (useful for debugging)
Pattern insertion is standard trie insertion:
def insert(self, root, pattern, pattern_id):
node = root
for char in pattern:
if char not in node.children:
node.children[char] = AhoCorasickNode()
node.children[char].depth = node.depth + 1
node = node.children[char]
node.output.append((pattern_id, pattern))
Building the Automaton
Construction happens in phases. First, build the trie. Second, compute failure links. Third, propagate output functions.
Phase 1: Trie Construction
Insert all patterns into the trie. Nothing special here—standard trie building with O(m) complexity where m is total pattern length.
Phase 2: Failure Links via BFS
This is where the magic happens. We compute failure links level by level using BFS. The key insight: a node’s failure link depends only on its parent’s failure link, so BFS guarantees we process parents before children.
from collections import deque
def build_failure_links(self, root):
queue = deque()
# Initialize: depth-1 nodes fail to root
for char, child in root.children.items():
child.failure = root
queue.append(child)
# BFS to compute remaining failure links
while queue:
current = queue.popleft()
for char, child in current.children.items():
queue.append(child)
# Follow failure links until we find a match or hit root
failure = current.failure
while failure is not None and char not in failure.children:
failure = failure.failure
if failure is None:
child.failure = root
else:
child.failure = failure.children[char]
# Merge output: if failure state matches patterns, so do we
child.output = child.output + child.failure.output
The output merging in the last line is crucial. If we’re at state “she” and our failure link points to “he”, any match of “he” is also a match at “she”. We propagate these during construction rather than chasing links during search.
Phase 3: Output Propagation
The code above handles this inline. After computing each failure link, we append the failure state’s output to the current state’s output. This means during search, we only need to check the current state’s output list—no link chasing required.
The Search Algorithm
With the automaton built, searching is straightforward. Start at root. For each input character, follow goto transitions when possible, failure links otherwise. At each state, report any patterns in the output list.
def search(self, root, text):
matches = [] # List of (position, pattern_id, pattern)
state = root
for i, char in enumerate(text):
# Follow failure links until we can take a goto transition
while state is not None and char not in state.children:
state = state.failure
if state is None:
state = root
continue
state = state.children[char]
# Report all matches at this state
for pattern_id, pattern in state.output:
match_start = i - len(pattern) + 1
matches.append((match_start, pattern_id, pattern))
return matches
Complexity breakdown: We process each of the n characters exactly once for the main loop. Failure link traversals seem like they could add overhead, but amortized analysis shows total failure link follows is O(n). Each follow decreases depth, and depth only increases by 1 per character. Output reporting takes O(z) for z matches. Construction is O(m) for m total pattern characters. Total: O(n + m + z).
Full Implementation
Here’s a complete, production-ready implementation:
from collections import deque
from typing import List, Tuple
class AhoCorasick:
def __init__(self):
self.root = self._create_node()
self._built = False
def _create_node(self):
return {
'children': {},
'failure': None,
'output': []
}
def add_pattern(self, pattern: str, pattern_id: int = None):
if self._built:
raise RuntimeError("Cannot add patterns after build()")
if pattern_id is None:
pattern_id = id(pattern)
node = self.root
for char in pattern:
if char not in node['children']:
node['children'][char] = self._create_node()
node = node['children'][char]
node['output'].append((pattern_id, pattern))
def build(self):
queue = deque()
# Depth-1 nodes fail to root
for child in self.root['children'].values():
child['failure'] = self.root
queue.append(child)
# BFS for remaining nodes
while queue:
current = queue.popleft()
for char, child in current['children'].items():
queue.append(child)
failure = current['failure']
while failure is not None and char not in failure['children']:
failure = failure['failure']
child['failure'] = failure['children'][char] if failure else self.root
child['output'] = child['output'] + child['failure']['output']
self._built = True
def search(self, text: str) -> List[Tuple[int, int, str]]:
if not self._built:
raise RuntimeError("Must call build() before search()")
matches = []
state = self.root
for i, char in enumerate(text):
while state is not None and char not in state['children']:
state = state['failure']
state = state['children'][char] if state else self.root
if state is None:
state = self.root
continue
for pattern_id, pattern in state['output']:
matches.append((i - len(pattern) + 1, pattern_id, pattern))
return matches
# Usage example
ac = AhoCorasick()
ac.add_pattern("he", 1)
ac.add_pattern("she", 2)
ac.add_pattern("his", 3)
ac.add_pattern("hers", 4)
ac.build()
text = "ushers"
for pos, pid, pattern in ac.search(text):
print(f"Found '{pattern}' at position {pos}")
# Output:
# Found 'she' at position 1
# Found 'he' at position 2
# Found 'hers' at position 2
Practical Applications & Optimizations
Intrusion Detection: Snort IDS uses Aho-Corasick to match thousands of attack signatures against network traffic in real-time. Every packet gets scanned against the full signature database in one pass.
Content Filtering: URL blocklists, profanity filters, and spam detection all benefit from multi-pattern matching. Here’s a practical URL checker:
class URLBlocklist:
def __init__(self, blocked_domains: List[str]):
self.ac = AhoCorasick()
for i, domain in enumerate(blocked_domains):
self.ac.add_pattern(domain.lower(), i)
self.ac.build()
def is_blocked(self, url: str) -> bool:
return len(self.ac.search(url.lower())) > 0
def get_violations(self, text: str) -> List[str]:
matches = self.ac.search(text.lower())
return list(set(pattern for _, _, pattern in matches))
# Block malicious domains
blocklist = URLBlocklist([
"malware.com", "phishing.net", "evil.org"
])
print(blocklist.is_blocked("https://malware.com/payload")) # True
print(blocklist.get_violations("Visit evil.org or malware.com"))
# ['evil.org', 'malware.com']
Memory Optimizations: For large pattern sets, consider array-based transitions instead of hash maps (faster but uses more memory for sparse alphabets), or double-array tries for compression. The pyahocorasick library implements these optimizations in C.
Alternatives: Rabin-Karp handles multiple patterns but with probabilistic guarantees. Regex engines are more flexible but slower for literal string matching. Use Aho-Corasick when you have many fixed strings to find.
Conclusion
Choose Aho-Corasick when you’re searching for multiple literal strings and performance matters. The algorithm shines with hundreds or thousands of patterns—the more patterns, the bigger your advantage over naive approaches.
For production use, don’t roll your own. Use pyahocorasick in Python (C extension, battle-tested), the aho-corasick crate in Rust, or Intel’s Hyperscan for extreme performance requirements. These implementations include optimizations that take years to get right.
The core insight—building a failure-linked automaton to process input once regardless of pattern count—remains one of the most elegant solutions in string algorithms. Forty-nine years after publication, it’s still the go-to choice for multi-pattern matching.