Longest Repeated Substring: Suffix Array Application

Key Insights

Suffix arrays combined with LCP arrays solve the Longest Repeated Substring problem in O(n log n) time and O(n) space, making them practical for strings with millions of characters.
The maximum value in the LCP array directly gives you the length of the longest repeated substring—no complex traversal required.
Suffix arrays offer a better memory profile than suffix trees while maintaining comparable performance, making them the preferred choice for memory-constrained environments.

Introduction & Problem Definition

The Longest Repeated Substring (LRS) problem asks a deceptively simple question: given a string, find the longest substring that appears at least twice. The substrings can overlap, which makes the problem more interesting than it first appears.

This problem shows up everywhere in practice. Plagiarism detection systems use LRS to find copied passages. Bioinformatics pipelines search for repeated DNA sequences that might indicate gene duplications or regulatory elements. Data compression algorithms like LZ77 exploit repeated substrings to achieve better compression ratios.

The naive approach is brutal: generate all O(n²) substrings, then for each substring, scan the original string to count occurrences. This gives you O(n³) time complexity with an additional O(n) for string comparisons—completely impractical for any real-world input. A 10,000-character string would require roughly 10¹² operations.

We need something smarter. Enter suffix arrays.

Suffix Array Fundamentals

A suffix array is an array of integers representing the starting positions of all suffixes of a string, sorted in lexicographical order. That’s it. No fancy tree structures, no pointers—just an array of indices.

Consider the string “banana”. Its suffixes are:

Index 0: "banana"
Index 1: "anana"
Index 2: "nana"
Index 3: "ana"
Index 4: "na"
Index 5: "a"

Sorted lexicographically:

"a"      -> Index 5
"ana"    -> Index 3
"anana"  -> Index 1
"banana" -> Index 0
"nana"   -> Index 2
"na"     -> Index 4

The suffix array is therefore: [5, 3, 1, 0, 2, 4].

Here’s a straightforward implementation:

def build_suffix_array_naive(s: str) -> list[int]:
    """
    Build suffix array using naive sorting.
    Time: O(n² log n) - O(n log n) comparisons, each O(n)
    """
    n = len(s)
    # Create list of (suffix_start_index, suffix_string)
    suffixes = [(i, s[i:]) for i in range(n)]
    # Sort by the suffix string
    suffixes.sort(key=lambda x: x[1])
    # Return just the indices
    return [idx for idx, _ in suffixes]

This works but has O(n² log n) complexity because each string comparison takes O(n) time. We can do better.

Efficient Suffix Array Construction

The prefix doubling algorithm constructs a suffix array in O(n log n) time. The key insight: instead of comparing entire suffixes, we compare prefixes of increasing lengths (1, 2, 4, 8, …). Each doubling phase uses the rankings from the previous phase to compute new rankings in O(n) time with radix sort.

def build_suffix_array(s: str) -> list[int]:
    """
    Build suffix array using prefix doubling.
    Time: O(n log n), Space: O(n)
    """
    n = len(s)
    if n == 0:
        return []
    
    # Initial ranking based on character values
    rank = [ord(c) for c in s]
    sa = list(range(n))
    k = 1  # Current prefix length being compared
    
    while k < n:
        # Create sort keys: (rank[i], rank[i+k]) for position i
        def sort_key(i):
            first = rank[i]
            second = rank[i + k] if i + k < n else -1
            return (first, second)
        
        # Sort suffix array by the compound key
        sa.sort(key=sort_key)
        
        # Compute new rankings
        new_rank = [0] * n
        new_rank[sa[0]] = 0
        
        for i in range(1, n):
            prev, curr = sa[i - 1], sa[i]
            # Compare (rank[prev], rank[prev+k]) with (rank[curr], rank[curr+k])
            prev_key = sort_key(prev)
            curr_key = sort_key(curr)
            new_rank[curr] = new_rank[prev] + (1 if curr_key != prev_key else 0)
        
        rank = new_rank
        
        # Early termination if all ranks are unique
        if rank[sa[-1]] == n - 1:
            break
            
        k *= 2
    
    return sa

For production systems handling massive strings, linear-time algorithms like SA-IS (Suffix Array by Induced Sorting) or DC3 exist. They’re more complex to implement but achieve O(n) construction time. For most applications, prefix doubling is the sweet spot between complexity and performance.

LCP Array: The Key to Finding Repeated Substrings

The Longest Common Prefix (LCP) array stores the length of the longest common prefix between each pair of adjacent suffixes in the sorted suffix array. This is where the magic happens for finding repeated substrings.

For “banana” with suffix array [5, 3, 1, 0, 2, 4]:

SA[0]=5: "a"      | LCP[0]=0 (no predecessor)
SA[1]=3: "ana"    | LCP[1]=1 (shares "a" with "a")
SA[2]=1: "anana"  | LCP[2]=3 (shares "ana" with "ana")
SA[3]=0: "banana" | LCP[3]=0 (shares nothing with "anana")
SA[4]=2: "nana"   | LCP[4]=0 (shares nothing with "banana")
SA[5]=4: "na"     | LCP[5]=2 (shares "na" with "nana")

Kasai’s algorithm computes the LCP array in O(n) time using a clever observation: if we process suffixes in text order (not sorted order), the LCP values decrease by at most 1 between consecutive computations.

def build_lcp_array(s: str, sa: list[int]) -> list[int]:
    """
    Build LCP array using Kasai's algorithm.
    Time: O(n), Space: O(n)
    """
    n = len(s)
    if n == 0:
        return []
    
    # Build inverse suffix array: rank[i] = position of suffix i in SA
    rank = [0] * n
    for i in range(n):
        rank[sa[i]] = i
    
    lcp = [0] * n
    k = 0  # Current LCP length
    
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        
        # j is the suffix that comes before suffix i in sorted order
        j = sa[rank[i] - 1]
        
        # Extend the match
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1
        
        lcp[rank[i]] = k
        
        # Key insight: LCP can decrease by at most 1
        if k > 0:
            k -= 1
    
    return lcp

Solving LRS Using Suffix Array + LCP

With both arrays in hand, finding the longest repeated substring becomes trivial: the maximum value in the LCP array is the length of the LRS. The actual substring can be extracted from the corresponding position in the suffix array.

def longest_repeated_substring(s: str) -> str:
    """
    Find the longest repeated substring using suffix array + LCP array.
    Time: O(n log n), Space: O(n)
    """
    if len(s) < 2:
        return ""
    
    # Build suffix array and LCP array
    sa = build_suffix_array(s)
    lcp = build_lcp_array(s, sa)
    
    # Find maximum LCP value
    max_lcp = 0
    max_idx = 0
    
    for i in range(1, len(lcp)):
        if lcp[i] > max_lcp:
            max_lcp = lcp[i]
            max_idx = i
    
    if max_lcp == 0:
        return ""
    
    # Extract the substring
    start = sa[max_idx]
    return s[start:start + max_lcp]


# Example usage
text = "banana"
result = longest_repeated_substring(text)
print(f"LRS of '{text}': '{result}'")  # Output: LRS of 'banana': 'ana'

Complexity Analysis & Comparison

Let’s break down the complexity:

Phase	Time	Space
Suffix Array Construction	O(n log n)	O(n)
LCP Array Construction	O(n)	O(n)
Maximum LCP Search	O(n)	O(1)
Total	O(n log n)	O(n)

Compared to alternatives:

Approach	Time	Space	Notes
Naive	O(n³)	O(n)	Impractical for n > 1000
Suffix Tree	O(n)	O(n)	Large constant factor (~20n bytes)
Suffix Array + LCP	O(n log n)	O(n)	~5n bytes, cache-friendly

Choose suffix arrays when memory matters. A suffix tree for a 1GB string needs roughly 20GB of RAM. A suffix array needs about 5GB. The O(log n) factor in construction time is negligible compared to the memory savings.

Variations & Extensions

Real applications often need more than just the single longest repeated substring. Here’s a solution that finds all repeated substrings of at least length k:

def find_repeated_substrings(s: str, min_length: int = 1) -> list[tuple[str, int]]:
    """
    Find all repeated substrings of length >= min_length.
    Returns list of (substring, occurrence_count) tuples.
    Time: O(n log n), Space: O(n)
    """
    if len(s) < 2:
        return []
    
    sa = build_suffix_array(s)
    lcp = build_lcp_array(s, sa)
    n = len(s)
    
    # Track unique repeated substrings
    repeated = {}
    
    for i in range(1, n):
        if lcp[i] >= min_length:
            # Extract the repeated prefix
            start = sa[i]
            substring = s[start:start + lcp[i]]
            
            # For counting, we need to track all occurrences
            if substring not in repeated:
                repeated[substring] = set()
            repeated[substring].add(sa[i])
            repeated[substring].add(sa[i - 1])
    
    # Convert to list with counts, filter substrings of other substrings
    result = [(sub, len(positions)) for sub, positions in repeated.items()]
    result.sort(key=lambda x: (-len(x[0]), x[0]))  # Sort by length desc, then alphabetically
    
    return result


# Example: Find all repeated substrings of length >= 2
text = "abracadabra"
repeated = find_repeated_substrings(text, min_length=2)
for substring, count in repeated:
    print(f"'{substring}' appears {count} times")

For finding the longest repeated substring across multiple strings, concatenate them with unique separator characters not in the alphabet, build a single suffix array, and filter LCP values to only count matches spanning different original strings.

Suffix arrays are a fundamental tool that belongs in every engineer’s toolkit. They’re simpler to implement than suffix trees, use less memory, and solve a wide class of string problems efficiently. Master them once, and you’ll find applications everywhere.