Longest Repeated Substring: Suffix Array Application
The Longest Repeated Substring (LRS) problem asks a deceptively simple question: given a string, find the longest substring that appears at least twice. The substrings can overlap, which makes the...
Key Insights
- Suffix arrays combined with LCP arrays solve the Longest Repeated Substring problem in O(n log n) time and O(n) space, making them practical for strings with millions of characters.
- The maximum value in the LCP array directly gives you the length of the longest repeated substring—no complex traversal required.
- Suffix arrays offer a better memory profile than suffix trees while maintaining comparable performance, making them the preferred choice for memory-constrained environments.
Introduction & Problem Definition
The Longest Repeated Substring (LRS) problem asks a deceptively simple question: given a string, find the longest substring that appears at least twice. The substrings can overlap, which makes the problem more interesting than it first appears.
This problem shows up everywhere in practice. Plagiarism detection systems use LRS to find copied passages. Bioinformatics pipelines search for repeated DNA sequences that might indicate gene duplications or regulatory elements. Data compression algorithms like LZ77 exploit repeated substrings to achieve better compression ratios.
The naive approach is brutal: generate all O(n²) substrings, then for each substring, scan the original string to count occurrences. This gives you O(n³) time complexity with an additional O(n) for string comparisons—completely impractical for any real-world input. A 10,000-character string would require roughly 10¹² operations.
We need something smarter. Enter suffix arrays.
Suffix Array Fundamentals
A suffix array is an array of integers representing the starting positions of all suffixes of a string, sorted in lexicographical order. That’s it. No fancy tree structures, no pointers—just an array of indices.
Consider the string “banana”. Its suffixes are:
Index 0: "banana"
Index 1: "anana"
Index 2: "nana"
Index 3: "ana"
Index 4: "na"
Index 5: "a"
Sorted lexicographically:
"a" -> Index 5
"ana" -> Index 3
"anana" -> Index 1
"banana" -> Index 0
"nana" -> Index 2
"na" -> Index 4
The suffix array is therefore: [5, 3, 1, 0, 2, 4].
Here’s a straightforward implementation:
def build_suffix_array_naive(s: str) -> list[int]:
"""
Build suffix array using naive sorting.
Time: O(n² log n) - O(n log n) comparisons, each O(n)
"""
n = len(s)
# Create list of (suffix_start_index, suffix_string)
suffixes = [(i, s[i:]) for i in range(n)]
# Sort by the suffix string
suffixes.sort(key=lambda x: x[1])
# Return just the indices
return [idx for idx, _ in suffixes]
This works but has O(n² log n) complexity because each string comparison takes O(n) time. We can do better.
Efficient Suffix Array Construction
The prefix doubling algorithm constructs a suffix array in O(n log n) time. The key insight: instead of comparing entire suffixes, we compare prefixes of increasing lengths (1, 2, 4, 8, …). Each doubling phase uses the rankings from the previous phase to compute new rankings in O(n) time with radix sort.
def build_suffix_array(s: str) -> list[int]:
"""
Build suffix array using prefix doubling.
Time: O(n log n), Space: O(n)
"""
n = len(s)
if n == 0:
return []
# Initial ranking based on character values
rank = [ord(c) for c in s]
sa = list(range(n))
k = 1 # Current prefix length being compared
while k < n:
# Create sort keys: (rank[i], rank[i+k]) for position i
def sort_key(i):
first = rank[i]
second = rank[i + k] if i + k < n else -1
return (first, second)
# Sort suffix array by the compound key
sa.sort(key=sort_key)
# Compute new rankings
new_rank = [0] * n
new_rank[sa[0]] = 0
for i in range(1, n):
prev, curr = sa[i - 1], sa[i]
# Compare (rank[prev], rank[prev+k]) with (rank[curr], rank[curr+k])
prev_key = sort_key(prev)
curr_key = sort_key(curr)
new_rank[curr] = new_rank[prev] + (1 if curr_key != prev_key else 0)
rank = new_rank
# Early termination if all ranks are unique
if rank[sa[-1]] == n - 1:
break
k *= 2
return sa
For production systems handling massive strings, linear-time algorithms like SA-IS (Suffix Array by Induced Sorting) or DC3 exist. They’re more complex to implement but achieve O(n) construction time. For most applications, prefix doubling is the sweet spot between complexity and performance.
LCP Array: The Key to Finding Repeated Substrings
The Longest Common Prefix (LCP) array stores the length of the longest common prefix between each pair of adjacent suffixes in the sorted suffix array. This is where the magic happens for finding repeated substrings.
For “banana” with suffix array [5, 3, 1, 0, 2, 4]:
SA[0]=5: "a" | LCP[0]=0 (no predecessor)
SA[1]=3: "ana" | LCP[1]=1 (shares "a" with "a")
SA[2]=1: "anana" | LCP[2]=3 (shares "ana" with "ana")
SA[3]=0: "banana" | LCP[3]=0 (shares nothing with "anana")
SA[4]=2: "nana" | LCP[4]=0 (shares nothing with "banana")
SA[5]=4: "na" | LCP[5]=2 (shares "na" with "nana")
Kasai’s algorithm computes the LCP array in O(n) time using a clever observation: if we process suffixes in text order (not sorted order), the LCP values decrease by at most 1 between consecutive computations.
def build_lcp_array(s: str, sa: list[int]) -> list[int]:
"""
Build LCP array using Kasai's algorithm.
Time: O(n), Space: O(n)
"""
n = len(s)
if n == 0:
return []
# Build inverse suffix array: rank[i] = position of suffix i in SA
rank = [0] * n
for i in range(n):
rank[sa[i]] = i
lcp = [0] * n
k = 0 # Current LCP length
for i in range(n):
if rank[i] == 0:
k = 0
continue
# j is the suffix that comes before suffix i in sorted order
j = sa[rank[i] - 1]
# Extend the match
while i + k < n and j + k < n and s[i + k] == s[j + k]:
k += 1
lcp[rank[i]] = k
# Key insight: LCP can decrease by at most 1
if k > 0:
k -= 1
return lcp
Solving LRS Using Suffix Array + LCP
With both arrays in hand, finding the longest repeated substring becomes trivial: the maximum value in the LCP array is the length of the LRS. The actual substring can be extracted from the corresponding position in the suffix array.
def longest_repeated_substring(s: str) -> str:
"""
Find the longest repeated substring using suffix array + LCP array.
Time: O(n log n), Space: O(n)
"""
if len(s) < 2:
return ""
# Build suffix array and LCP array
sa = build_suffix_array(s)
lcp = build_lcp_array(s, sa)
# Find maximum LCP value
max_lcp = 0
max_idx = 0
for i in range(1, len(lcp)):
if lcp[i] > max_lcp:
max_lcp = lcp[i]
max_idx = i
if max_lcp == 0:
return ""
# Extract the substring
start = sa[max_idx]
return s[start:start + max_lcp]
# Example usage
text = "banana"
result = longest_repeated_substring(text)
print(f"LRS of '{text}': '{result}'") # Output: LRS of 'banana': 'ana'
Complexity Analysis & Comparison
Let’s break down the complexity:
| Phase | Time | Space |
|---|---|---|
| Suffix Array Construction | O(n log n) | O(n) |
| LCP Array Construction | O(n) | O(n) |
| Maximum LCP Search | O(n) | O(1) |
| Total | O(n log n) | O(n) |
Compared to alternatives:
| Approach | Time | Space | Notes |
|---|---|---|---|
| Naive | O(n³) | O(n) | Impractical for n > 1000 |
| Suffix Tree | O(n) | O(n) | Large constant factor (~20n bytes) |
| Suffix Array + LCP | O(n log n) | O(n) | ~5n bytes, cache-friendly |
Choose suffix arrays when memory matters. A suffix tree for a 1GB string needs roughly 20GB of RAM. A suffix array needs about 5GB. The O(log n) factor in construction time is negligible compared to the memory savings.
Variations & Extensions
Real applications often need more than just the single longest repeated substring. Here’s a solution that finds all repeated substrings of at least length k:
def find_repeated_substrings(s: str, min_length: int = 1) -> list[tuple[str, int]]:
"""
Find all repeated substrings of length >= min_length.
Returns list of (substring, occurrence_count) tuples.
Time: O(n log n), Space: O(n)
"""
if len(s) < 2:
return []
sa = build_suffix_array(s)
lcp = build_lcp_array(s, sa)
n = len(s)
# Track unique repeated substrings
repeated = {}
for i in range(1, n):
if lcp[i] >= min_length:
# Extract the repeated prefix
start = sa[i]
substring = s[start:start + lcp[i]]
# For counting, we need to track all occurrences
if substring not in repeated:
repeated[substring] = set()
repeated[substring].add(sa[i])
repeated[substring].add(sa[i - 1])
# Convert to list with counts, filter substrings of other substrings
result = [(sub, len(positions)) for sub, positions in repeated.items()]
result.sort(key=lambda x: (-len(x[0]), x[0])) # Sort by length desc, then alphabetically
return result
# Example: Find all repeated substrings of length >= 2
text = "abracadabra"
repeated = find_repeated_substrings(text, min_length=2)
for substring, count in repeated:
print(f"'{substring}' appears {count} times")
For finding the longest repeated substring across multiple strings, concatenate them with unique separator characters not in the alphabet, build a single suffix array, and filter LCP values to only count matches spanning different original strings.
Suffix arrays are a fundamental tool that belongs in every engineer’s toolkit. They’re simpler to implement than suffix trees, use less memory, and solve a wide class of string problems efficiently. Master them once, and you’ll find applications everywhere.