Suffix Array: Efficient String Data Structure

A suffix array is exactly what it sounds like: a sorted array of all suffixes of a string. Given a string of length n, you generate all n suffixes, sort them lexicographically, and store their...

Key Insights

  • Suffix arrays provide the same pattern-matching power as suffix trees while using 4-10x less memory, making them the practical choice for most string processing tasks.
  • Combined with the LCP (Longest Common Prefix) array, suffix arrays can solve complex string problems like finding longest repeated substrings in linear time.
  • Modern construction algorithms achieve O(n) time complexity, but for most practical purposes, an O(n log n) approach using optimized sorting is fast enough and far simpler to implement.

Introduction to Suffix Arrays

A suffix array is exactly what it sounds like: a sorted array of all suffixes of a string. Given a string of length n, you generate all n suffixes, sort them lexicographically, and store their starting indices. That’s it.

The concept emerged in 1990 as a space-efficient alternative to suffix trees. While suffix trees offer O(m) pattern matching (where m is pattern length), they consume significant memory—typically 10-20 bytes per character. Suffix arrays need just 4-8 bytes per character (one integer per position), making them practical for large-scale applications.

Where do suffix arrays shine? Bioinformatics tools like BWA and Bowtie use them for genome alignment. Search engines employ them for autocomplete and full-text indexing. Compression algorithms like bzip2 rely on the related Burrows-Wheeler Transform. If you’re processing strings at scale, suffix arrays belong in your toolkit.

How Suffix Arrays Work

Let’s build intuition with a concrete example. Consider the string “banana$” (the $ is a sentinel character, lexicographically smaller than all others, ensuring proper sorting).

First, enumerate all suffixes with their starting indices:

Index 0: banana$
Index 1: anana$
Index 2: nana$
Index 3: ana$
Index 4: na$
Index 5: a$
Index 6: $

Now sort these lexicographically:

Index 6: $
Index 5: a$
Index 3: ana$
Index 1: anana$
Index 0: banana$
Index 4: na$
Index 2: nana$

The suffix array is simply the sequence of starting indices: [6, 5, 3, 1, 0, 4, 2].

This sorted structure enables binary search over all suffixes. Want to find “ana”? Binary search narrows you to indices 3 and 1 in O(log n) comparisons.

Here’s a naive implementation:

def build_suffix_array_naive(text: str) -> list[int]:
    """
    Build suffix array using naive O(n² log n) approach.
    Simple but slow for large strings.
    """
    n = len(text)
    # Create list of (suffix, starting_index) pairs
    suffixes = [(text[i:], i) for i in range(n)]
    # Sort by suffix string
    suffixes.sort(key=lambda x: x[0])
    # Return just the indices
    return [idx for _, idx in suffixes]

# Example usage
text = "banana$"
sa = build_suffix_array_naive(text)
print(sa)  # [6, 5, 3, 1, 0, 4, 2]

This works but has terrible performance. Creating explicit suffix strings costs O(n²) space, and comparison-based sorting with O(n) string comparisons yields O(n² log n) time complexity.

Efficient Construction Algorithms

Production systems use linear-time algorithms. The two most notable are DC3 (also called Skew) and SA-IS (Suffix Array Induced Sorting).

SA-IS, published in 2009, is the current gold standard. It classifies each position as either S-type (suffix is smaller than the next suffix) or L-type (larger). It then recursively processes a reduced problem on special “LMS” (leftmost S-type) positions and induces the full solution. The algorithm achieves O(n) time with O(n) extra space.

For most practical applications, however, you don’t need linear construction. An O(n log n) approach using comparison sort with clever optimizations handles strings up to millions of characters efficiently:

def build_suffix_array_fast(text: str) -> list[int]:
    """
    Build suffix array in O(n log n) using doubling technique.
    Practical for strings up to ~10^7 characters.
    """
    n = len(text)
    # Initial rank based on character values
    rank = [ord(c) for c in text]
    sa = list(range(n))
    tmp = [0] * n
    k = 1
    
    while k < n:
        # Sort by (rank[i], rank[i+k]) pairs
        def compare_key(i):
            return (rank[i], rank[i + k] if i + k < n else -1)
        
        sa.sort(key=compare_key)
        
        # Compute new ranks
        tmp[sa[0]] = 0
        for i in range(1, n):
            prev, curr = sa[i - 1], sa[i]
            prev_key = (rank[prev], rank[prev + k] if prev + k < n else -1)
            curr_key = (rank[curr], rank[curr + k] if curr + k < n else -1)
            tmp[curr] = tmp[prev] + (1 if curr_key > prev_key else 0)
        
        rank, tmp = tmp, rank
        
        # Early termination if all ranks unique
        if rank[sa[n - 1]] == n - 1:
            break
        k *= 2
    
    return sa

This doubling algorithm compares suffixes by their first k characters, then 2k, then 4k, and so on. After O(log n) iterations, we’ve compared enough characters to fully order all suffixes.

The LCP Array

The suffix array alone enables O(m log n) pattern matching. But many advanced algorithms require the LCP (Longest Common Prefix) array—for each adjacent pair in the sorted suffix array, how many characters do they share?

For “banana$” with SA = [6, 5, 3, 1, 0, 4, 2]:

SA[0]=6: $       LCP[0] = 0 (by definition)
SA[1]=5: a$      LCP[1] = 0 ($ vs a$)
SA[2]=3: ana$    LCP[2] = 1 (a$ vs ana$, share "a")
SA[3]=1: anana$  LCP[3] = 3 (ana$ vs anana$, share "ana")
SA[4]=0: banana$ LCP[4] = 0 (anana$ vs banana$)
SA[5]=4: na$     LCP[5] = 0 (banana$ vs na$)
SA[6]=2: nana$   LCP[6] = 2 (na$ vs nana$, share "na")

Kasai’s algorithm computes the LCP array in O(n) time by exploiting a key insight: if we know LCP between suffixes starting at positions i and j, the LCP between positions i+1 and j+1 is at least that value minus 1.

def build_lcp_array(text: str, sa: list[int]) -> list[int]:
    """
    Kasai's algorithm for O(n) LCP array construction.
    Requires the suffix array as input.
    """
    n = len(text)
    rank = [0] * n
    lcp = [0] * n
    
    # Compute inverse suffix array (rank)
    for i in range(n):
        rank[sa[i]] = i
    
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        
        j = sa[rank[i] - 1]  # Previous suffix in sorted order
        while i + k < n and j + k < n and text[i + k] == text[j + k]:
            k += 1
        
        lcp[rank[i]] = k
        
        # Key insight: LCP decreases by at most 1
        if k > 0:
            k -= 1
    
    return lcp

Pattern Searching with Suffix Arrays

Binary search on a suffix array finds any pattern in O(m log n) time, where m is the pattern length. Here’s a complete implementation:

def find_pattern(text: str, sa: list[int], pattern: str) -> list[int]:
    """
    Find all occurrences of pattern in text using suffix array.
    Returns list of starting positions.
    """
    n = len(text)
    m = len(pattern)
    
    # Find leftmost occurrence
    left = 0
    right = n
    while left < right:
        mid = (left + right) // 2
        suffix = text[sa[mid]:sa[mid] + m]
        if suffix < pattern:
            left = mid + 1
        else:
            right = mid
    start = left
    
    # Find rightmost occurrence
    right = n
    while left < right:
        mid = (left + right) // 2
        suffix = text[sa[mid]:sa[mid] + m]
        if suffix <= pattern:
            left = mid + 1
        else:
            right = mid
    end = left
    
    return sorted(sa[start:end])

# Example
text = "mississippi$"
sa = build_suffix_array_fast(text)
positions = find_pattern(text, sa, "issi")
print(positions)  # [1, 4]

Practical Applications

The combination of suffix array and LCP array solves numerous string problems elegantly.

Longest Repeated Substring: Simply find the maximum value in the LCP array—that’s the length of the longest substring appearing at least twice.

def longest_repeated_substring(text: str) -> str:
    """
    Find the longest substring that appears at least twice.
    """
    if len(text) < 2:
        return ""
    
    sa = build_suffix_array_fast(text)
    lcp = build_lcp_array(text, sa)
    
    max_lcp = max(lcp)
    if max_lcp == 0:
        return ""
    
    max_idx = lcp.index(max_lcp)
    start = sa[max_idx]
    return text[start:start + max_lcp]

# Example
text = "abracadabra$"
result = longest_repeated_substring(text)
print(result)  # "abra"

Longest Common Substring: Concatenate two strings with different separators, build the suffix array, then find the maximum LCP where adjacent suffixes come from different original strings.

Burrows-Wheeler Transform: The BWT used in bzip2 compression is simply the last column when you sort all rotations of a string—directly related to suffix arrays.

Performance Comparison and When to Use

Here’s the practical breakdown:

Approach Construction Query Time Space
Naive search O(1) O(nm) O(1)
Hash table O(n) O(m) average O(n)
Suffix tree O(n) O(m) O(20n)
Suffix array O(n) O(m log n) O(5n)
SA + LCP O(n) O(m + log n) O(9n)

Use suffix arrays when:

  • Memory matters (genome sequences, large document collections)
  • You need multiple pattern queries on the same text
  • You need LCP-based algorithms (longest repeated substring, etc.)
  • You want simpler implementation than suffix trees

Consider alternatives when:

  • You have few queries (naive search wins for < 10 queries)
  • You need only exact single-pattern matching (hash tables suffice)
  • You need online construction as text arrives (suffix trees handle this better)

For most string processing tasks on static text, suffix arrays hit the sweet spot of simplicity, performance, and memory efficiency. Master them, and you’ll have a powerful tool for everything from log analysis to bioinformatics.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.