KMP Algorithm: String Pattern Matching

Key Insights

The KMP algorithm achieves O(n+m) time complexity by preprocessing the pattern to build a failure function, eliminating the redundant character comparisons that plague naive string matching.
The failure function encodes the longest proper prefix that is also a suffix for each position in the pattern, allowing the algorithm to skip ahead intelligently when mismatches occur.
While built-in string methods are fine for simple cases, understanding KMP becomes essential when dealing with streaming data, multiple pattern occurrences, or performance-critical text processing.

Introduction to Pattern Matching

String pattern matching is one of those fundamental problems that appears everywhere in software engineering. Every time you hit Ctrl+F in your text editor, run a grep command, or search through log files, you’re invoking pattern matching algorithms. Beyond the obvious use cases, these algorithms power DNA sequence analysis in bioinformatics, plagiarism detection systems, network intrusion detection, and the autocomplete features you use daily.

The core problem is straightforward: given a text of length n and a pattern of length m, find all occurrences of the pattern within the text. What makes this interesting is the dramatic difference between naive and optimized approaches. When you’re searching through gigabytes of log files or processing real-time network traffic, the difference between O(n*m) and O(n+m) isn’t academic—it’s the difference between a system that works and one that doesn’t.

The Naive Approach and Its Limitations

The brute-force approach to string matching is intuitive. Slide the pattern across the text one position at a time, comparing characters at each position until you find a mismatch or complete match.

def naive_search(text: str, pattern: str) -> list[int]:
    """
    Naive string matching with O(n*m) worst-case complexity.
    Returns list of starting indices where pattern is found.
    """
    n, m = len(text), len(pattern)
    matches = []
    
    for i in range(n - m + 1):
        j = 0
        while j < m and text[i + j] == pattern[j]:
            j += 1
        if j == m:
            matches.append(i)
    
    return matches

This works, but consider what happens when searching for “AAAB” in “AAAAAAAAAB”. At each starting position, we compare three ‘A’ characters successfully before failing on ‘B’. Then we shift by one position and repeat the same comparisons. We’re throwing away information we already gathered.

The worst case occurs with patterns like “AAAA…AB” in text like “AAAA…AAB”. For a text of length n and pattern of length m, we perform nearly n*m comparisons. When n and m are both large, this becomes prohibitively expensive.

The fundamental flaw is backtracking. When a mismatch occurs, the naive algorithm resets the pattern pointer to the beginning and moves the text pointer back, re-examining characters it has already seen.

The Key Insight: Failure Function

The Knuth-Morris-Pratt algorithm’s brilliance lies in a simple observation: when a mismatch occurs, the pattern itself contains information about where to resume matching. We don’t need to backtrack in the text at all.

The failure function (also called the prefix table or partial match table) preprocesses the pattern to answer this question: “If I’ve matched j characters and then hit a mismatch, what’s the longest proper prefix of the pattern that’s also a suffix of what I’ve matched?”

A proper prefix is any prefix that isn’t the entire string. For the pattern “ABCABD”, if we’ve matched “ABCAB” and then fail on ‘D’, we notice that “AB” appears both at the start and end of “ABCAB”. We can slide the pattern forward so the prefix “AB” aligns with where the suffix “AB” was, resuming from there instead of starting over.

def build_failure_function(pattern: str) -> list[int]:
    """
    Build the failure function (prefix table) for KMP algorithm.
    
    failure[i] = length of longest proper prefix of pattern[0:i+1]
                 that is also a suffix of pattern[0:i+1]
    
    Time complexity: O(m) where m is pattern length
    """
    m = len(pattern)
    failure = [0] * m
    
    # length of the previous longest prefix suffix
    length = 0
    i = 1
    
    while i < m:
        if pattern[i] == pattern[length]:
            # We can extend the current prefix-suffix
            length += 1
            failure[i] = length
            i += 1
        elif length > 0:
            # Mismatch: fall back to shorter prefix-suffix
            # Key insight: use the failure function on itself
            length = failure[length - 1]
            # Don't increment i; retry with shorter length
        else:
            # No prefix-suffix exists
            failure[i] = 0
            i += 1
    
    return failure

Let’s trace through building the failure function for “ABABAC”:

Index	Character	failure[i]	Explanation
0	A	0	Single char, no proper prefix
1	B	0	“AB” - no matching prefix/suffix
2	A	1	“ABA” - “A” is both prefix and suffix
3	B	2	“ABAB” - “AB” matches
4	A	3	“ABABA” - “ABA” matches
5	C	0	“ABABAC” - no match, fall back to 0

The failure function construction itself uses the failure function recursively—this is what makes it O(m) rather than O(m²).

KMP Algorithm Walkthrough

With the failure function built, the main algorithm never backtracks in the text. When a mismatch occurs at pattern position j, we use failure[j-1] to determine where to resume in the pattern.

def kmp_search(text: str, pattern: str) -> list[int]:
    """
    Knuth-Morris-Pratt string matching algorithm.
    
    Time complexity: O(n + m)
    Space complexity: O(m) for the failure function
    
    Returns list of starting indices where pattern is found.
    """
    if not pattern:
        return []
    
    n, m = len(text), len(pattern)
    failure = build_failure_function(pattern)
    matches = []
    
    i = 0  # index in text
    j = 0  # index in pattern
    
    while i < n:
        if text[i] == pattern[j]:
            # Characters match, advance both pointers
            i += 1
            j += 1
            
            if j == m:
                # Complete match found
                matches.append(i - m)
                # Continue searching using failure function
                j = failure[j - 1]
        
        elif j > 0:
            # Mismatch after some matches
            # Use failure function to skip ahead in pattern
            # Crucially: do NOT increment i
            j = failure[j - 1]
        
        else:
            # Mismatch at pattern start, advance text pointer
            i += 1
    
    return matches

Let’s trace through searching for “ABABAC” in “ABABABABAC”:

Text:    A B A B A B A B A C
Pattern: A B A B A C
         0 1 2 3 4 5

i=0, j=0: A=A ✓, i=1, j=1
i=1, j=1: B=B ✓, i=2, j=2
i=2, j=2: A=A ✓, i=3, j=3
i=3, j=3: B=B ✓, i=4, j=4
i=4, j=4: A=A ✓, i=5, j=5
i=5, j=5: B≠C ✗, j=failure[4]=3

Pattern shifts, but i stays at 5:
Text:    A B A B A B A B A C
Pattern:     A B A B A C
             0 1 2 3 4 5

i=5, j=3: B=B ✓, i=6, j=4
i=6, j=4: A=A ✓, i=7, j=5
i=7, j=5: B≠C ✗, j=failure[4]=3

Pattern shifts again:
Text:    A B A B A B A B A C
Pattern:         A B A B A C
                 0 1 2 3 4 5

i=7, j=3: B=B ✓, i=8, j=4
i=8, j=4: A=A ✓, i=9, j=5
i=9, j=5: C=C ✓, i=10, j=6

j=6=m: Match found at index 4!

Notice how the text pointer i never decreases. We examined each character of the text exactly once.

Time and Space Complexity Analysis

The KMP algorithm achieves O(n+m) time complexity, broken into two phases:

Preprocessing (building failure function): O(m)

The while loop in build_failure_function might seem like it could be O(m²) due to the inner fallback, but the amortized analysis reveals otherwise. The variable length can increase at most m times (once per iteration when characters match). The fallback length = failure[length - 1] strictly decreases length. Since length can’t decrease more times than it increases, the total operations are bounded by 2m.

Searching: O(n)

Similar reasoning applies. The text pointer i advances n times. The pattern pointer j can advance at most n times (once per i advancement). The fallback j = failure[j - 1] strictly decreases j, and j can’t decrease more than it increases. Total operations: bounded by 2n.

Space complexity: O(m)

We store the failure function array of length m. The search itself uses only constant extra space.

Compare this to the naive O(n*m) approach: for a 1GB text file and a 1KB pattern, naive matching might perform up to 10¹² comparisons. KMP performs roughly 10⁹—a thousand-fold improvement.

Practical Applications and Variations

Beyond single-pattern matching, KMP principles extend to several practical scenarios:

def find_all_occurrences_with_context(
    text: str, 
    pattern: str, 
    context_chars: int = 20
) -> list[dict]:
    """
    Find all pattern occurrences with surrounding context.
    Useful for search result previews.
    """
    matches = kmp_search(text, pattern)
    results = []
    
    for pos in matches:
        start = max(0, pos - context_chars)
        end = min(len(text), pos + len(pattern) + context_chars)
        
        results.append({
            'position': pos,
            'context': text[start:end],
            'match_start_in_context': pos - start
        })
    
    return results


def streaming_kmp_matcher(pattern: str):
    """
    Generator-based KMP for streaming text processing.
    Yields match positions as they're found.
    """
    failure = build_failure_function(pattern)
    m = len(pattern)
    j = 0
    i = 0
    
    while True:
        char = yield  # Receive next character
        if char is None:
            break
            
        while j > 0 and char != pattern[j]:
            j = failure[j - 1]
        
        if char == pattern[j]:
            j += 1
        
        if j == m:
            yield i - m + 1  # Found match
            j = failure[j - 1]
        
        i += 1

Related algorithms worth knowing:

Boyer-Moore: Often faster in practice for large alphabets; searches right-to-left and can skip more characters
Rabin-Karp: Uses hashing; excellent for multiple pattern search
Aho-Corasick: Extends KMP concepts to search for multiple patterns simultaneously

Conclusion

The KMP algorithm represents a fundamental insight in algorithm design: preprocessing can eliminate redundant work. By investing O(m) time upfront to analyze the pattern, we avoid the backtracking that makes naive matching expensive.

Use KMP when you need guaranteed linear-time performance, when processing streaming data where you can’t backtrack in the text, or when you need to find all occurrences efficiently. For simple one-off searches, your language’s built-in string methods (which often use hybrid approaches) are perfectly fine.

The failure function concept extends beyond string matching—it’s the foundation for more sophisticated algorithms and appears in various forms throughout text processing. Understanding it deeply pays dividends across many problem domains.