Z-Algorithm: Linear-Time Pattern Matching

Key Insights

The Z-algorithm constructs an array in O(n) time where each Z[i] represents the length of the longest substring starting at position i that matches a prefix of the string—this seemingly simple structure enables powerful pattern matching.
By concatenating pattern + sentinel + text and finding positions where Z[i] equals the pattern length, you get all pattern occurrences in linear time without the complexity of KMP’s failure function.
The algorithm’s “Z-box” technique reuses previously computed information to avoid redundant character comparisons, making it both elegant and practical for production text search implementations.

Introduction to Pattern Matching

String matching is one of computing’s fundamental problems: given a pattern of length m and a text of length n, find all occurrences of the pattern within the text. The naive approach—sliding the pattern across the text and checking each position character by character—runs in O(n×m) time. For a 1GB log file and a 100-character pattern, that’s potentially 100 billion comparisons.

This matters when you’re building search functionality into text editors, implementing grep-like utilities, or processing genomic data where both patterns and texts can be millions of characters long. You need linear time.

The Z-algorithm achieves O(n+m) complexity with an approach that’s arguably more intuitive than the Knuth-Morris-Pratt (KMP) algorithm’s failure function. While KMP builds a partial match table by analyzing how the pattern can overlap with itself, the Z-algorithm directly computes prefix match lengths—a concept that’s easier to visualize and debug. Rabin-Karp offers a different trade-off, using hashing for expected linear time but with worst-case O(n×m) performance and potential false positives.

Understanding the Z-Array

The Z-array is the heart of this algorithm. For a string S of length n, the Z-array is defined as:

Z[i] = the length of the longest substring starting at S[i] that matches a prefix of S

By convention, Z[0] is undefined or set to 0 (or sometimes n), since the entire string trivially matches itself as a prefix.

Let’s trace through the string "aabxaab":

Index:    0   1   2   3   4   5   6
String:   a   a   b   x   a   a   b
Z-array:  -   1   0   0   3   1   0

Here’s the reasoning for each position:

Z[1] = 1: Starting at index 1, we have "abxaab". Comparing with prefix "aabxaab": ‘a’ matches ‘a’, but ‘b’ ≠ ‘a’. Match length = 1.
Z[2] = 0: Starting at index 2, ‘b’ ≠ ‘a’ (first character of prefix). Match length = 0.
Z[3] = 0: Starting at index 3, ‘x’ ≠ ‘a’. Match length = 0.
Z[4] = 3: Starting at index 4, we have "aab". Comparing with prefix: ‘a’=‘a’, ‘a’=‘a’, ‘b’=‘b’, then we hit the end. Match length = 3.
Z[5] = 1: Starting at index 5, ‘a’=‘a’, but ‘b’≠‘a’. Match length = 1.
Z[6] = 0: Starting at index 6, ‘b’≠‘a’. Match length = 0.

This structure tells us everything about how the string relates to its own prefixes—information that becomes powerful for pattern matching.

The Z-Algorithm: Core Logic

Computing the Z-array naively would take O(n²) time. The Z-algorithm achieves O(n) by maintaining a “Z-box”—the interval [L, R] representing the rightmost substring we’ve found that matches a prefix of S.

The key insight: if we’re computing Z[i] and i falls within a previously computed Z-box, we already know something about the characters at position i. We can bootstrap our calculation using Z[i - L], the corresponding position within the prefix.

Here’s the algorithm:

def compute_z_array(s: str) -> list[int]:
    n = len(s)
    if n == 0:
        return []
    
    z = [0] * n
    z[0] = n  # Convention: entire string matches itself
    
    # Z-box boundaries: [left, right]
    left, right = 0, 0
    
    for i in range(1, n):
        if i > right:
            # Case 1: i is outside the Z-box
            # Start fresh comparison from position i
            left, right = i, i
            while right < n and s[right - left] == s[right]:
                right += 1
            z[i] = right - left
            right -= 1
        else:
            # Case 2: i is inside the Z-box [left, right]
            # k is the corresponding index in the prefix
            k = i - left
            
            if z[k] < right - i + 1:
                # Case 2a: Z[k] doesn't reach the Z-box boundary
                # We can directly use Z[k]
                z[i] = z[k]
            else:
                # Case 2b: Z[k] reaches or exceeds the boundary
                # We need to check beyond the Z-box
                left = i
                while right < n and s[right - left] == s[right]:
                    right += 1
                z[i] = right - left
                right -= 1
    
    return z

Let me trace through "aabxaab" to show the Z-box in action:

i=1: Outside Z-box (right=0). Compare s[0..] with s[1..]: ‘a’=‘a’, ‘a’≠‘b’. Z[1]=1. Z-box becomes [1,1].
i=2: Outside Z-box. ‘b’≠‘a’. Z[2]=0. Z-box stays [1,1].
i=3: Outside Z-box. ‘x’≠‘a’. Z[3]=0.
i=4: Outside Z-box. Compare: ‘a’=‘a’, ‘a’=‘a’, ‘b’=‘b’, end. Z[4]=3. Z-box becomes [4,6].
i=5: Inside Z-box! k=5-4=1, Z[1]=1. Since 1 < (6-5+1)=2, we use Z[1] directly. Z[5]=1.
i=6: Inside Z-box. k=6-4=2, Z[2]=0. Since 0 < (6-6+1)=1, Z[6]=0.

The Z-box lets us skip comparisons we’ve effectively already done.

Pattern Matching Using Z-Array

To find pattern P in text T, we use a clever concatenation:

combined = P + "$" + T

The sentinel character “$” (any character not in P or T) prevents false matches across the boundary. Now, any position i in the combined string where Z[i] equals len(P) indicates a match—because that position’s substring matches the entire pattern (which is the prefix).

def find_pattern(text: str, pattern: str) -> list[int]:
    if not pattern or not text:
        return []
    
    # Concatenate with sentinel
    combined = pattern + "$" + text
    z = compute_z_array(combined)
    
    pattern_len = len(pattern)
    matches = []
    
    # Check positions in the text portion
    for i in range(pattern_len + 1, len(combined)):
        if z[i] == pattern_len:
            # Convert to position in original text
            text_position = i - pattern_len - 1
            matches.append(text_position)
    
    return matches


# Example usage
text = "aabxaabxaab"
pattern = "aab"
print(find_pattern(text, pattern))  # Output: [0, 4, 8]

Time and Space Complexity Analysis

Time Complexity: O(n+m)

The linear time guarantee comes from observing that the right boundary R only moves forward. Each character comparison either:

Advances R (at most n times total), or
Uses a cached Z-value without comparison

This amortized analysis shows we perform at most O(n) comparisons.

Space Complexity: O(n+m)

We store the Z-array for the combined string. For memory-constrained applications, you can compute Z-values on-the-fly and discard them after checking, reducing space to O(m) for storing just the pattern’s Z-array—though this requires a more complex implementation.

Comparison with KMP:

Both achieve O(n+m) time. KMP’s failure function is conceptually similar but computed differently—it tracks where to resume matching after a mismatch. The Z-algorithm’s prefix-matching interpretation is often more intuitive for developers encountering these algorithms for the first time.

Practical Applications

Beyond basic substring search, the Z-algorithm shines in several domains:

Bioinformatics: DNA sequences use a 4-character alphabet (A, C, G, T). Finding gene patterns in chromosomes—strings of millions of characters—requires linear-time algorithms.

Log Analysis: Searching for error patterns across gigabytes of log files benefits from the predictable O(n) performance.

Text Editors: Implementing find-and-replace with real-time highlighting as users type.

Here’s a practical grep-like utility:

import sys
from pathlib import Path


def grep_file(filepath: Path, pattern: str, 
              context_lines: int = 0) -> None:
    """Search for pattern in file, print matching lines."""
    try:
        content = filepath.read_text()
    except (IOError, UnicodeDecodeError) as e:
        print(f"Error reading {filepath}: {e}", file=sys.stderr)
        return
    
    lines = content.split('\n')
    matches = find_pattern(content, pattern)
    
    # Map character positions to line numbers
    line_starts = [0]
    for i, line in enumerate(lines[:-1]):
        line_starts.append(line_starts[-1] + len(line) + 1)
    
    matched_lines = set()
    for pos in matches:
        # Binary search for line number
        line_num = binary_search_line(line_starts, pos)
        matched_lines.add(line_num)
    
    for line_num in sorted(matched_lines):
        print(f"{filepath}:{line_num + 1}: {lines[line_num]}")


def binary_search_line(line_starts: list[int], pos: int) -> int:
    lo, hi = 0, len(line_starts) - 1
    while lo < hi:
        mid = (lo + hi + 1) // 2
        if line_starts[mid] <= pos:
            lo = mid
        else:
            hi = mid - 1
    return lo

Conclusion

The Z-algorithm offers a clean, linear-time solution to pattern matching with an intuitive foundation: prefix matching. Choose it when you need:

Guaranteed O(n+m) worst-case performance
An algorithm that’s easy to implement correctly
Clear debugging—Z-array values have direct interpretations

For multiple pattern matching, consider Aho-Corasick. For approximate matching, explore edit-distance algorithms. But for exact single-pattern search, the Z-algorithm is a reliable, efficient choice that belongs in every engineer’s toolkit.