Suffix Trie: All Suffixes Storage

A suffix trie is a trie (prefix tree) that contains all suffixes of a given string. While a standard trie stores a collection of separate words, a suffix trie stores every possible ending of a single...

Key Insights

  • Suffix tries store every suffix of a string in a tree structure, enabling O(m) substring searches where m is the pattern length—dramatically faster than naive O(nm) approaches
  • The naive construction requires O(n²) space and time, making suffix tries impractical for very long strings but excellent for moderate-length texts requiring repeated queries
  • Understanding suffix tries provides the conceptual foundation for more advanced structures like suffix trees and suffix arrays used in production systems

Introduction to Suffix Tries

A suffix trie is a trie (prefix tree) that contains all suffixes of a given string. While a standard trie stores a collection of separate words, a suffix trie stores every possible ending of a single string. For the string “banana”, you’d store: “banana”, “anana”, “nana”, “ana”, “na”, and “a”.

This seemingly simple modification unlocks powerful capabilities. Instead of asking “does this word exist in my dictionary?”, you can now ask “does this pattern appear anywhere in my text?"—and answer it in time proportional only to the pattern length.

The structure works by sharing common prefixes among suffixes. In “banana”, both “ana” and “anana” share the prefix “ana”, so they branch from the same path in the trie. This prefix sharing is what makes the data structure efficient for queries, even though it requires storing substantial redundant information during construction.

Why Suffix Tries Matter

Suffix tries solve a fundamental problem in string processing: finding patterns within text. Consider these real-world applications:

Pattern matching in text editors. When you press Ctrl+F in your IDE, the editor needs to find all occurrences of your search term. With a suffix trie built on the document, each search completes in O(m) time regardless of document size.

DNA sequence analysis. Bioinformatics frequently requires finding short sequences (motifs) within long genomic strings. A suffix trie on a chromosome allows researchers to locate any subsequence instantly.

Autocomplete systems. When a user types a partial query, you need to find all completions. If you’ve indexed text using a suffix trie, you can find all occurrences of any substring and suggest relevant completions.

Plagiarism detection. Comparing documents for shared substrings becomes tractable when you can query substring existence in constant time per character.

The complexity advantage is substantial. Naive substring search using Python’s in operator or Java’s contains() runs in O(nm) time, where n is the text length and m is the pattern length. With a suffix trie, after O(n²) preprocessing, every search takes O(m) time. If you’re performing many searches on the same text, this tradeoff pays off quickly.

Building a Suffix Trie

Construction follows a straightforward process: generate all suffixes of the input string, then insert each one into a trie. For a string of length n, you’ll insert n suffixes with lengths ranging from n down to 1.

Here’s a Python implementation:

class SuffixTrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_suffix = False
        self.suffix_indices = []  # Track where this suffix starts in original string


class SuffixTrie:
    def __init__(self, text: str):
        self.root = SuffixTrieNode()
        self.text = text
        self._build_trie()
    
    def _build_trie(self):
        """Insert all suffixes of the text into the trie."""
        for i in range(len(self.text)):
            self._insert_suffix(i)
    
    def _insert_suffix(self, start_index: int):
        """Insert a single suffix starting at start_index."""
        node = self.root
        for char in self.text[start_index:]:
            if char not in node.children:
                node.children[char] = SuffixTrieNode()
            node = node.children[char]
            node.suffix_indices.append(start_index)
        node.is_end_of_suffix = True

The construction complexity breaks down as follows: you insert n suffixes, and the i-th suffix has length (n - i). The total characters inserted sum to n + (n-1) + (n-2) + … + 1 = n(n+1)/2, giving O(n²) time complexity. Space complexity is also O(n²) in the worst case, as each character insertion potentially creates a new node.

Core Operations

With the trie constructed, several operations become efficient.

Substring search determines whether a pattern exists anywhere in the original text. You simply walk down the trie following the pattern’s characters. If you can traverse the entire pattern without hitting a dead end, the pattern exists as a substring:

def search(self, pattern: str) -> bool:
    """Check if pattern exists as a substring in the text."""
    node = self.root
    for char in pattern:
        if char not in node.children:
            return False
        node = node.children[char]
    return True

def find_occurrences(self, pattern: str) -> list[int]:
    """Return all starting positions where pattern occurs."""
    node = self.root
    for char in pattern:
        if char not in node.children:
            return []
        node = node.children[char]
    return node.suffix_indices

def count_occurrences(self, pattern: str) -> int:
    """Count how many times pattern appears in the text."""
    return len(self.find_occurrences(pattern))

Finding the longest repeated substring leverages the trie structure differently. Any internal node with multiple children (or multiple suffix indices) represents a repeated substring. You traverse the trie, tracking the deepest node that has been visited by multiple suffixes:

def longest_repeated_substring(self) -> str:
    """Find the longest substring that appears more than once."""
    result = ""
    
    def dfs(node: SuffixTrieNode, current_path: str):
        nonlocal result
        # A node with multiple suffix indices means this prefix is repeated
        if len(node.suffix_indices) > 1 and len(current_path) > len(result):
            result = current_path
        
        for char, child in node.children.items():
            dfs(child, current_path + char)
    
    dfs(self.root, "")
    return result

Practical Implementation Patterns

The implementation above uses explicit node objects with dictionaries for children. This approach prioritizes clarity and flexibility. For production use, consider these optimizations:

Array-based children work well when the alphabet is small and known. Replace the dictionary with a fixed-size array indexed by character codes. This trades memory for faster lookups:

class OptimizedNode:
    def __init__(self, alphabet_size: int = 256):
        self.children = [None] * alphabet_size
        self.suffix_indices = []
    
    def get_child(self, char: str):
        return self.children[ord(char)]
    
    def set_child(self, char: str, node):
        self.children[ord(char)] = node

Implicit vs. explicit suffix tries represent a design choice. An explicit trie stores every node, including those with single children. An implicit trie compresses chains of single-child nodes into single edges labeled with substrings. Implicit tries save space but complicate implementation.

Here’s a complete working example demonstrating practical usage:

class SuffixTrie:
    def __init__(self, text: str):
        self.root = SuffixTrieNode()
        self.text = text
        self._build_trie()
    
    def _build_trie(self):
        for i in range(len(self.text)):
            self._insert_suffix(i)
    
    def _insert_suffix(self, start_index: int):
        node = self.root
        for char in self.text[start_index:]:
            if char not in node.children:
                node.children[char] = SuffixTrieNode()
            node = node.children[char]
            node.suffix_indices.append(start_index)
        node.is_end_of_suffix = True
    
    def search(self, pattern: str) -> bool:
        node = self.root
        for char in pattern:
            if char not in node.children:
                return False
            node = node.children[char]
        return True
    
    def find_occurrences(self, pattern: str) -> list[int]:
        node = self.root
        for char in pattern:
            if char not in node.children:
                return []
            node = node.children[char]
        return sorted(node.suffix_indices)


# Usage example
text = "mississippi"
trie = SuffixTrie(text)

print(trie.search("issi"))      # True
print(trie.search("xyz"))       # False
print(trie.find_occurrences("issi"))  # [1, 4]
print(trie.find_occurrences("ss"))    # [2, 5]

Limitations and Alternatives

The O(n²) space complexity is the suffix trie’s critical limitation. A 1 MB text file would require roughly 1 TB of memory for its suffix trie. This makes suffix tries impractical for large-scale text processing.

Suffix trees compress the trie by collapsing chains of single-child nodes. This reduces space to O(n) while preserving O(m) query time. Ukkonen’s algorithm constructs suffix trees in O(n) time. The implementation complexity increases substantially, but suffix trees power many production systems.

Suffix arrays store the sorted order of all suffixes using just n integers. Combined with auxiliary structures like LCP (Longest Common Prefix) arrays, they match suffix tree functionality with better cache performance and simpler implementation. Most modern bioinformatics tools use suffix arrays.

When to use suffix tries anyway:

  • Educational contexts where understanding the concept matters more than efficiency
  • Short strings (under 10,000 characters) where O(n²) space is acceptable
  • Prototyping before optimizing with more complex structures
  • Problems where the trie’s explicit structure simplifies the algorithm

Summary

Suffix tries provide an intuitive entry point into suffix-based string algorithms. By storing all suffixes of a string in a trie, you enable O(m) substring searches, occurrence counting, and longest repeated substring detection.

Operation Time Complexity Notes
Construction O(n²) Insert all n suffixes
Substring search O(m) m = pattern length
Count occurrences O(m) Returns count directly
Find all occurrences O(m + k) k = number of occurrences
Longest repeated substring O(n²) Traverse entire structure
Space O(n²) Worst case

For production systems processing large texts, graduate to suffix trees or suffix arrays. But for understanding the core concepts and handling moderate-sized strings, suffix tries remain a practical and instructive choice.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.