LCP Array: Longest Common Prefix

The suffix array revolutionized string processing by providing a space-efficient alternative to suffix trees. But the suffix array alone is just a sorted list of suffix positions—it tells you the...

Key Insights

  • The LCP array stores the longest common prefix lengths between consecutive suffixes in a sorted suffix array, enabling O(1) lookups that power efficient string algorithms for pattern matching, duplicate detection, and substring counting.
  • Kasai’s algorithm constructs the LCP array in O(n) time by exploiting a clever invariant: when moving to the next suffix in text order, the LCP value decreases by at most one, allowing incremental computation.
  • Combining LCP arrays with Range Minimum Query structures unlocks O(1) LCP queries between any two suffixes, making complex string operations like longest repeated substring trivially efficient.

Introduction to LCP Arrays

The suffix array revolutionized string processing by providing a space-efficient alternative to suffix trees. But the suffix array alone is just a sorted list of suffix positions—it tells you the lexicographic order of suffixes but nothing about their relationships. The LCP (Longest Common Prefix) array fills this gap.

The LCP array stores, for each pair of adjacent suffixes in the sorted order, how many characters they share at their beginning. This seemingly simple addition transforms the suffix array from a basic sorting structure into a powerful tool for solving complex string problems efficiently.

If you’re working on text indexing, bioinformatics, data compression, or plagiarism detection, understanding LCP arrays isn’t optional—it’s essential.

Understanding the Fundamentals

The LCP array is defined relative to a suffix array. Given a string S of length n and its suffix array SA, the LCP array LCP has n entries where:

  • LCP[0] = 0 (by convention, or undefined)
  • LCP[i] = length of longest common prefix between suffix SA[i] and suffix SA[i-1] for i > 0

Let’s walk through the classic example with S = "banana":

Index:  0 1 2 3 4 5
String: b a n a n a

Suffixes:
0: banana
1: anana
2: nana
3: ana
4: na
5: a

Sorted suffixes (Suffix Array SA):
SA[0] = 5  ->  "a"
SA[1] = 3  ->  "ana"
SA[2] = 1  ->  "anana"
SA[3] = 0  ->  "banana"
SA[4] = 4  ->  "na"
SA[5] = 2  ->  "nana"

LCP Array:
LCP[0] = 0       (no previous suffix)
LCP[1] = 1       ("a" vs "ana" share "a")
LCP[2] = 3       ("ana" vs "anana" share "ana")
LCP[3] = 0       ("anana" vs "banana" share nothing)
LCP[4] = 0       ("banana" vs "na" share nothing)
LCP[5] = 2       ("na" vs "nana" share "na")

Here’s the basic data structure representation:

from dataclasses import dataclass
from typing import List

@dataclass
class SuffixArrayWithLCP:
    text: str
    suffix_array: List[int]  # SA[i] = starting position of i-th suffix in sorted order
    lcp_array: List[int]     # LCP[i] = LCP between SA[i] and SA[i-1]
    
    def get_suffix(self, rank: int) -> str:
        """Return the suffix at given rank in sorted order."""
        return self.text[self.suffix_array[rank]:]
    
    def get_lcp(self, rank: int) -> int:
        """Return LCP between suffix at rank and rank-1."""
        if rank == 0:
            return 0
        return self.lcp_array[rank]

Naive Construction Algorithm

The straightforward approach compares each pair of adjacent suffixes character by character. It’s inefficient but serves as a baseline and helps verify more complex implementations.

def build_suffix_array_naive(text: str) -> List[int]:
    """Build suffix array using simple sorting. O(n² log n) worst case."""
    n = len(text)
    suffixes = [(text[i:], i) for i in range(n)]
    suffixes.sort()
    return [pos for _, pos in suffixes]

def build_lcp_naive(text: str, suffix_array: List[int]) -> List[int]:
    """
    Build LCP array by comparing adjacent suffixes directly.
    Time: O(n²) worst case - each comparison can take O(n).
    """
    n = len(text)
    lcp = [0] * n
    
    for i in range(1, n):
        # Get the two adjacent suffixes in sorted order
        suffix1_start = suffix_array[i - 1]
        suffix2_start = suffix_array[i]
        
        # Compare character by character
        j = 0
        while (suffix1_start + j < n and 
               suffix2_start + j < n and
               text[suffix1_start + j] == text[suffix2_start + j]):
            j += 1
        
        lcp[i] = j
    
    return lcp

# Example usage
text = "banana"
sa = build_suffix_array_naive(text)
lcp = build_lcp_naive(text, sa)
print(f"SA:  {sa}")   # [5, 3, 1, 0, 4, 2]
print(f"LCP: {lcp}")  # [0, 1, 3, 0, 0, 2]

This works, but the O(n²) complexity becomes painful for strings longer than a few thousand characters. We need something better.

Kasai’s Algorithm (Linear Time Construction)

Kasai’s algorithm (also known as Kasai-Lee-Arimura-Arikawa-Park algorithm) constructs the LCP array in O(n) time. The key insight is brilliant: if we process suffixes in their original text order (not sorted order), we can exploit a useful property.

The Invariant: If suffix at position i has LCP value h with its predecessor in sorted order, then suffix at position i+1 has LCP value at least h-1 with its predecessor.

Why? When you move from suffix S[i:] to S[i+1:], you’re essentially removing the first character. If S[i:] shared h characters with its sorted neighbor, then S[i+1:] shares at least h-1 characters with some suffix (the relationship is preserved, minus one character).

def build_lcp_kasai(text: str, suffix_array: List[int]) -> List[int]:
    """
    Kasai's algorithm for O(n) LCP array construction.
    
    Key insight: Process suffixes in text order, not sorted order.
    The LCP value decreases by at most 1 when moving to the next
    position in the original text.
    """
    n = len(text)
    lcp = [0] * n
    
    # Build inverse suffix array: rank[i] = position of suffix i in sorted order
    rank = [0] * n
    for i in range(n):
        rank[suffix_array[i]] = i
    
    h = 0  # Current LCP length (the key variable that only decreases by 1)
    
    for i in range(n):  # Process suffixes in TEXT order
        if rank[i] > 0:
            # j is the starting position of the previous suffix in sorted order
            j = suffix_array[rank[i] - 1]
            
            # Extend the match as far as possible
            while (i + h < n and j + h < n and text[i + h] == text[j + h]):
                h += 1
            
            lcp[rank[i]] = h
            
            # Key step: decrease h by at most 1 for next iteration
            if h > 0:
                h -= 1
    
    return lcp

The algorithm is O(n) because h increases at most n times total (once per character match) and decreases at most n times (once per suffix processed). The inner while loop’s total iterations across all outer loop iterations is bounded by 2n.

Practical Applications

The LCP array shines when solving string problems that would otherwise require complex tree structures or expensive repeated computations.

Finding the Longest Repeated Substring

The longest repeated substring is simply the maximum value in the LCP array—adjacent suffixes with the longest common prefix represent repeated substrings.

def longest_repeated_substring(text: str) -> str:
    """
    Find the longest substring that appears at least twice.
    Time: O(n) after suffix array construction.
    """
    if len(text) < 2:
        return ""
    
    sa = build_suffix_array_naive(text)  # Use O(n) algorithm in production
    lcp = build_lcp_kasai(text, sa)
    
    # Find maximum LCP value
    max_lcp = 0
    max_idx = 0
    for i in range(1, len(lcp)):
        if lcp[i] > max_lcp:
            max_lcp = lcp[i]
            max_idx = i
    
    if max_lcp == 0:
        return ""
    
    # Extract the repeated substring
    start = sa[max_idx]
    return text[start:start + max_lcp]

# Example
text = "abracadabra"
result = longest_repeated_substring(text)
print(f"Longest repeated substring: '{result}'")  # "abra"

Counting Distinct Substrings

Every substring is a prefix of some suffix. The total number of substrings is n(n+1)/2, but we’re double-counting overlaps indicated by the LCP array.

def count_distinct_substrings(text: str) -> int:
    """
    Count unique substrings using LCP array.
    Formula: total_substrings - sum(LCP values)
    """
    n = len(text)
    if n == 0:
        return 0
    
    sa = build_suffix_array_naive(text)
    lcp = build_lcp_kasai(text, sa)
    
    # Total possible substrings
    total = n * (n + 1) // 2
    
    # Subtract overlaps (LCP values represent duplicate prefixes)
    duplicates = sum(lcp)
    
    return total - duplicates

text = "banana"
print(f"Distinct substrings: {count_distinct_substrings(text)}")  # 15

LCP Queries with Range Minimum Query

What if you need the LCP between two arbitrary suffixes, not just adjacent ones? The LCP between suffixes at ranks i and j (where i < j) equals the minimum value in LCP[i+1..j]. This is a classic Range Minimum Query problem.

import math

class SparseTable:
    """
    Sparse table for O(1) range minimum queries after O(n log n) preprocessing.
    """
    def __init__(self, arr: List[int]):
        n = len(arr)
        if n == 0:
            self.table = []
            self.log = []
            return
            
        k = int(math.log2(n)) + 1
        
        # table[i][j] = minimum in range starting at j with length 2^i
        self.table = [[0] * n for _ in range(k)]
        self.table[0] = arr[:]
        
        # Build sparse table
        for i in range(1, k):
            length = 1 << i
            for j in range(n - length + 1):
                self.table[i][j] = min(
                    self.table[i-1][j],
                    self.table[i-1][j + (1 << (i-1))]
                )
        
        # Precompute logs
        self.log = [0] * (n + 1)
        for i in range(2, n + 1):
            self.log[i] = self.log[i // 2] + 1
    
    def query(self, left: int, right: int) -> int:
        """Return minimum in arr[left..right] inclusive. O(1)."""
        length = right - left + 1
        k = self.log[length]
        return min(
            self.table[k][left],
            self.table[k][right - (1 << k) + 1]
        )

class LCPQueryStructure:
    """Combined suffix array + LCP array with O(1) LCP queries."""
    
    def __init__(self, text: str):
        self.text = text
        self.sa = build_suffix_array_naive(text)
        self.lcp = build_lcp_kasai(text, self.sa)
        self.rmq = SparseTable(self.lcp)
        
        # Inverse suffix array for position -> rank lookup
        self.rank = [0] * len(text)
        for i, pos in enumerate(self.sa):
            self.rank[pos] = i
    
    def lcp_of_suffixes(self, pos1: int, pos2: int) -> int:
        """Return LCP between suffix starting at pos1 and pos2. O(1)."""
        if pos1 == pos2:
            return len(self.text) - pos1
        
        r1, r2 = self.rank[pos1], self.rank[pos2]
        if r1 > r2:
            r1, r2 = r2, r1
        
        return self.rmq.query(r1 + 1, r2)

# Example usage
lcp_struct = LCPQueryStructure("banana")
print(lcp_struct.lcp_of_suffixes(1, 3))  # LCP of "anana" and "ana" = 3

Performance Considerations & Conclusion

Operation Time Complexity Space Complexity
Naive LCP construction O(n²) O(n)
Kasai’s algorithm O(n) O(n)
Sparse table build O(n log n) O(n log n)
RMQ query O(1)

When to use LCP arrays:

  • You need multiple string queries on static text
  • Memory is constrained (compared to suffix trees)
  • You’re already using suffix arrays

When to consider alternatives:

  • Dynamic text with frequent updates (use suffix trees or hash-based approaches)
  • Simple pattern matching only (KMP or Boyer-Moore suffice)
  • Very short strings (naive approaches are fine)

The LCP array is one of those data structures that seems niche until you need it—then it becomes indispensable. Master Kasai’s algorithm, understand the RMQ connection, and you’ll have a powerful tool for tackling string problems that would otherwise require much more complex solutions.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.