Boyer-Moore Algorithm: Efficient String Search

Key Insights

Boyer-Moore searches patterns right-to-left, enabling it to skip large portions of text and achieve sublinear average-case performance—often examining only n/m characters instead of all n.
The algorithm combines two heuristics: the bad character rule (shift based on mismatched characters) and the good suffix rule (shift based on already-matched suffixes), taking the maximum shift from both.
For most real-world text search scenarios with reasonably long patterns, Boyer-Moore outperforms alternatives like KMP and Rabin-Karp, which is why it powers tools like grep and text editors.

The Problem with Naive String Search

Every programmer has written a nested loop to find a substring. You slide the pattern across the text, comparing character by character. It works, but it’s O(nm) where n is text length and m is pattern length. Search for a 100-character pattern in a 10MB log file, and you’re looking at up to a billion comparisons.

In 1977, Robert Boyer and J Strother Moore published an algorithm that flipped conventional thinking. Instead of trying to match faster, they asked: how can we skip more? The result is an algorithm that often examines only a fraction of the text’s characters.

Boyer-Moore powers real-world tools you use daily. The grep utility uses it (or variants). Text editors rely on it for find-and-replace. Plagiarism detection systems, DNA sequence matching, and intrusion detection systems all benefit from its efficiency.

The Core Insight: Search Right-to-Left

The naive approach compares pattern characters left-to-right. Boyer-Moore compares right-to-left—from the end of the pattern toward the beginning. This seemingly minor change enables dramatic optimizations.

Consider searching for “EXAMPLE” in a text. When you align the pattern and compare the last character ‘E’ against the text, a mismatch tells you something valuable. If the text character is ‘Z’, which doesn’t appear anywhere in “EXAMPLE”, you can shift the entire pattern past that position—seven characters at once instead of one.

def naive_right_to_left_compare(text: str, pattern: str, start: int) -> bool:
    """Compare pattern against text starting at 'start', going right-to-left."""
    m = len(pattern)
    for i in range(m - 1, -1, -1):
        if text[start + i] != pattern[i]:
            return False
    return True

This right-to-left comparison is the foundation. But the real power comes from two preprocessing heuristics that determine how far to shift after a mismatch.

The Bad Character Rule

When a mismatch occurs, look at the text character that caused it. The bad character rule asks: does this character appear elsewhere in the pattern? If so, shift the pattern to align that occurrence with the mismatched position. If not, shift the pattern completely past the mismatch.

For pattern “EXAMPLE”, we precompute the rightmost position of each character:

def build_bad_character_table(pattern: str) -> dict:
    """
    Build table mapping each character to its rightmost position in pattern.
    Characters not in pattern are implicitly -1 (not stored).
    """
    table = {}
    for i, char in enumerate(pattern):
        table[char] = i
    return table


def bad_character_shift(pattern: str, bad_char_table: dict, 
                         mismatch_pos: int, text_char: str) -> int:
    """
    Calculate shift based on bad character rule.
    
    mismatch_pos: index in pattern where mismatch occurred
    text_char: the character in text that caused mismatch
    """
    last_occurrence = bad_char_table.get(text_char, -1)
    
    # Shift pattern so last_occurrence aligns with mismatch position
    # If character doesn't exist or is to the right, shift by at least 1
    shift = mismatch_pos - last_occurrence
    return max(1, shift)

For “EXAMPLE”: E→6, X→1, A→2, M→3, P→4, L→5. If we mismatch at position 6 (the final ‘E’) against text character ‘A’, we shift so position 2 (where ‘A’ appears in pattern) aligns with the mismatch point—a shift of 4 positions.

The Good Suffix Rule

The good suffix rule is more complex but equally powerful. When a mismatch occurs after matching some suffix of the pattern, we can use that matched suffix to determine the shift.

Two cases apply:

The matched suffix appears elsewhere in the pattern. Shift to align that earlier occurrence with our matched portion.
A prefix of the pattern matches a suffix of our matched portion. Shift to align that prefix.

def build_good_suffix_table(pattern: str) -> list:
    """
    Build the good suffix shift table.
    
    Returns a list where table[i] is the shift when mismatch occurs at position i
    (meaning pattern[i+1:] matched successfully).
    """
    m = len(pattern)
    table = [0] * (m + 1)
    
    # suffix[i] = length of longest suffix of pattern ending at i
    # that is also a suffix of the entire pattern
    suffix = compute_suffix_array(pattern)
    
    # Case 2: prefix matches suffix
    # Fill with the shift to align longest matching prefix
    j = 0
    for i in range(m - 1, -1, -1):
        if suffix[i] == i + 1:  # pattern[0:i+1] is a suffix of pattern
            while j < m - 1 - i:
                if table[j] == 0:
                    table[j] = m - 1 - i
                j += 1
    
    # Case 1: suffix appears elsewhere in pattern
    for i in range(m - 1):
        table[m - 1 - suffix[i]] = m - 1 - i
    
    return table


def compute_suffix_array(pattern: str) -> list:
    """Compute suffix lengths for good suffix preprocessing."""
    m = len(pattern)
    suffix = [0] * m
    suffix[m - 1] = m
    
    g = m - 1
    f = 0
    
    for i in range(m - 2, -1, -1):
        if i > g and suffix[i + m - 1 - f] < i - g:
            suffix[i] = suffix[i + m - 1 - f]
        else:
            g = min(g, i)
            f = i
            while g >= 0 and pattern[g] == pattern[g + m - 1 - f]:
                g -= 1
            suffix[i] = f - g
    
    return suffix

Complete Implementation

The full algorithm preprocesses both tables, then scans the text while taking the maximum shift from either heuristic:

def boyer_moore_search(text: str, pattern: str) -> list:
    """
    Find all occurrences of pattern in text using Boyer-Moore algorithm.
    Returns list of starting indices where pattern is found.
    """
    n, m = len(text), len(pattern)
    
    if m == 0 or m > n:
        return []
    
    # Preprocessing
    bad_char = build_bad_character_table(pattern)
    good_suffix = build_good_suffix_table(pattern)
    
    matches = []
    s = 0  # shift of pattern with respect to text
    
    while s <= n - m:
        j = m - 1  # start comparing from end of pattern
        
        # Compare right-to-left until mismatch or complete match
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1
        
        if j < 0:
            # Pattern found at position s
            matches.append(s)
            # Shift for next potential match
            s += good_suffix[0] if s + m < n else 1
        else:
            # Mismatch: take maximum of both heuristics
            bc_shift = bad_character_shift(pattern, bad_char, j, text[s + j])
            gs_shift = good_suffix[j + 1] if j + 1 < len(good_suffix) else 1
            s += max(bc_shift, gs_shift)
    
    return matches

Let’s trace through searching for “ABC” in “ABAAABCD”:

Text:    A B A A A B C D
Pattern: A B C
Position 0: Compare C vs A → mismatch at j=2
  Bad char: 'A' at position 0 in pattern, shift = 2-0 = 2
  Shift by 2

Text:    A B A A A B C D
Pattern:     A B C
Position 2: Compare C vs A → mismatch at j=2
  Bad char: 'A' at position 0, shift = 2
  Shift by 2

Text:    A B A A A B C D
Pattern:         A B C
Position 4: Compare C vs C ✓, B vs B ✓, A vs A ✓
  Match found at position 4!

Performance Analysis

Boyer-Moore’s performance characteristics:

Best case: O(n/m). When mismatches occur at the pattern’s end and the mismatched character doesn’t appear in the pattern, we skip m characters each time.
Average case: Sublinear. For natural language text and reasonably long patterns, performance is typically O(n/m) to O(n).
Worst case: O(nm). Pathological cases like searching for “aaa” in “aaaaaaa” degrade to quadratic time.

Benchmarks on English text (searching for 20-character patterns in 1MB files) typically show Boyer-Moore examining only 15-25% of characters, compared to 100% for naive search and KMP.

Boyer-Moore excels with longer patterns and larger alphabets. It struggles with very short patterns (less than 4-5 characters) where preprocessing overhead isn’t amortized, and with small alphabets (like DNA’s ACGT) where the bad character rule provides minimal shifts.

Practical Considerations and Variants

The full Boyer-Moore algorithm’s good suffix preprocessing is complex. In 1980, Nigel Horspool proposed a simplification that uses only the bad character rule with a twist: always use the text character aligned with the pattern’s last position for shift calculation.

def boyer_moore_horspool(text: str, pattern: str) -> list:
    """
    Simplified Boyer-Moore using only bad character heuristic.
    Often faster in practice due to simpler preprocessing and inner loop.
    """
    n, m = len(text), len(pattern)
    
    if m == 0 or m > n:
        return []
    
    # Build shift table: how far to shift when character is seen
    # at position m-1 (aligned with pattern's end)
    shift = {char: m for char in set(text)}  # default: full pattern length
    for i in range(m - 1):  # exclude last character
        shift[pattern[i]] = m - 1 - i
    
    matches = []
    s = 0
    
    while s <= n - m:
        j = m - 1
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1
        
        if j < 0:
            matches.append(s)
            s += 1  # simple shift after match
        else:
            s += shift.get(text[s + m - 1], m)
    
    return matches

Horspool often outperforms full Boyer-Moore in practice. The simpler inner loop and reduced preprocessing overhead compensate for occasionally smaller shifts. For most applications, start with Horspool and only implement full Boyer-Moore if benchmarks justify it.

Memory considerations: the bad character table for full Unicode requires significant space. For ASCII (256 characters), it’s trivial. For Unicode, use a hash map with a default shift value, as shown in the implementations above.

When choosing a string search algorithm, consider your constraints. Boyer-Moore variants are ideal for searching long patterns in large texts with sizeable alphabets. For multiple pattern search, consider Aho-Corasick. For approximate matching, look at bitap or suffix arrays. But for the common case of finding a single pattern in text, Boyer-Moore remains the practical choice that powers the tools we rely on daily.