Suffix Automaton: Minimal DFA for All Substrings

Key Insights

Suffix automata achieve O(n) space and construction time while representing all substrings of a string—making them more memory-efficient than suffix trees for many applications.
The endpos equivalence class concept is the theoretical foundation that guarantees minimality: states represent sets of substrings that share identical ending positions.
Online construction allows character-by-character building, making suffix automata ideal for streaming text processing and incremental indexing scenarios.

Introduction to Suffix Automata

A suffix automaton is the minimal deterministic finite automaton (DFA) that accepts exactly all substrings of a given string. If you’ve worked with suffix trees or suffix arrays, you know they’re powerful but come with trade-offs. Suffix trees consume significant memory (often 20+ bytes per character), while suffix arrays sacrifice some query flexibility for compactness.

Suffix automata hit a sweet spot. They require at most 2n - 1 states and 3n - 4 transitions for a string of length n, achieving O(n) construction time. Unlike suffix trees, they don’t store explicit substring boundaries—instead, they encode substring relationships through state equivalence classes.

When should you reach for a suffix automaton? Consider them when you need to answer substring existence queries, count distinct substrings, find longest common substrings, or perform lexicographic operations on substrings. They’re particularly valuable in competitive programming and bioinformatics where both speed and memory matter.

Core Concepts and Structure

The suffix automaton’s elegance comes from a single insight: group substrings by where they end in the original string.

For a string like “abcbc”, the substring “bc” appears ending at positions 3 and 5. The substring “c” ends at positions 3 and 5 too. These substrings share the same endpos set—the set of ending positions. Substrings with identical endpos sets form an equivalence class, and each equivalence class becomes exactly one state in the automaton.

This is why the automaton is minimal: you can’t have fewer states because each state represents a distinct endpos equivalence class, and you can’t merge states with different endpos sets without accepting invalid strings.

Each state tracks three things:

len: The length of the longest substring in this equivalence class
link: A suffix link pointing to the state representing the longest proper suffix with a different endpos set
transitions: A map from characters to next states

The suffix links form a tree structure rooted at the initial state. This tree is crucial for construction and many queries.

struct SuffixAutomaton {
    struct State {
        int len;
        int link;
        std::map<char, int> next;
        
        State() : len(0), link(-1) {}
    };
    
    std::vector<State> states;
    int last;
    
    SuffixAutomaton() {
        states.emplace_back();
        last = 0;
    }
};

The initial state (index 0) represents the empty string with endpos equal to all positions. The last variable tracks the state corresponding to the entire string built so far.

Online Construction Algorithm

The construction algorithm processes one character at a time, extending the automaton incrementally. This online property is valuable—you can build the automaton as characters stream in without knowing the full string upfront.

When adding a character c, three scenarios arise:

Simple extension: We create a new state and can immediately link everything up
Clone required: An existing state represents substrings of varying lengths, and we need to split it
Direct transition exists: We just update suffix links without creating transitions

The algorithm walks up suffix links from the last state, adding transitions until it finds a state that already has a transition on c (or reaches the root).

void extend(char c) {
    // Create new state for the current prefix
    int cur = states.size();
    states.emplace_back();
    states[cur].len = states[last].len + 1;
    
    // Walk suffix links, adding transitions to the new state
    int p = last;
    while (p != -1 && states[p].next.find(c) == states[p].next.end()) {
        states[p].next[c] = cur;
        p = states[p].link;
    }
    
    if (p == -1) {
        // Case 1: No existing transition on 'c' from any suffix
        // Link directly to root
        states[cur].link = 0;
    } else {
        int q = states[p].next[c];
        
        if (states[p].len + 1 == states[q].len) {
            // Case 2: State q represents exactly the suffix we need
            // No cloning required
            states[cur].link = q;
        } else {
            // Case 3: Must clone state q
            // q represents substrings of multiple lengths; split it
            int clone = states.size();
            states.emplace_back();
            states[clone].len = states[p].len + 1;
            states[clone].next = states[q].next;
            states[clone].link = states[q].link;
            
            // Redirect transitions from q to clone where appropriate
            while (p != -1 && states[p].next[c] == q) {
                states[p].next[c] = clone;
                p = states[p].link;
            }
            
            // Update suffix links
            states[q].link = states[cur].link = clone;
        }
    }
    
    last = cur;
}

The cloning step is the subtle part. When state q has len greater than states[p].len + 1, it means q represents substrings that are “too long” to be proper suffixes of what we’re adding. We clone q, keeping the shorter substrings in the clone and leaving longer ones in the original.

Fundamental Operations

With the automaton built, we can answer queries efficiently.

Substring existence is straightforward: simulate the DFA.

bool contains(const std::string& pattern) {
    int cur = 0;
    for (char c : pattern) {
        auto it = states[cur].next.find(c);
        if (it == states[cur].next.end()) {
            return false;
        }
        cur = it->second;
    }
    return true;
}

Counting distinct substrings exploits the structure directly. Each state represents substrings of lengths from states[link].len + 1 to states[cur].len. Sum these ranges across all non-root states.

long long countDistinctSubstrings() {
    long long count = 0;
    for (int i = 1; i < states.size(); i++) {
        int linkLen = (states[i].link == -1) ? 0 : states[states[i].link].len;
        count += states[i].len - linkLen;
    }
    return count;
}

Longest common substring between two strings uses a clever traversal. Build the automaton for the first string, then walk through the second string while tracking the longest match.

std::string longestCommonSubstring(const std::string& s1, const std::string& s2) {
    // Build automaton for s1
    SuffixAutomaton sa;
    for (char c : s1) sa.extend(c);
    
    int cur = 0, curLen = 0;
    int bestLen = 0, bestPos = 0;
    
    for (int i = 0; i < s2.size(); i++) {
        char c = s2[i];
        
        // Follow suffix links until we find a transition or hit root
        while (cur != 0 && sa.states[cur].next.find(c) == sa.states[cur].next.end()) {
            cur = sa.states[cur].link;
            curLen = sa.states[cur].len;
        }
        
        if (sa.states[cur].next.find(c) != sa.states[cur].next.end()) {
            cur = sa.states[cur].next[c];
            curLen++;
        }
        
        if (curLen > bestLen) {
            bestLen = curLen;
            bestPos = i - curLen + 1;
        }
    }
    
    return s2.substr(bestPos, bestLen);
}

Advanced Applications

Beyond basic queries, suffix automata enable sophisticated string algorithms.

Finding the k-th lexicographically smallest substring requires preprocessing: compute how many distinct substrings are reachable from each state, then navigate greedily.

std::vector<long long> countPaths; // paths[i] = distinct substrings from state i

void computePaths() {
    int n = states.size();
    countPaths.assign(n, 0);
    std::vector<int> order(n);
    std::iota(order.begin(), order.end(), 0);
    std::sort(order.begin(), order.end(), [&](int a, int b) {
        return states[a].len > states[b].len;
    });
    
    for (int v : order) {
        countPaths[v] = 1; // count this state itself
        for (auto& [c, u] : states[v].next) {
            countPaths[v] += countPaths[u];
        }
    }
}

std::string kthSubstring(long long k) {
    std::string result;
    int cur = 0;
    
    while (k > 0) {
        for (auto& [c, next] : states[cur].next) {
            if (countPaths[next] >= k) {
                result += c;
                k--; // account for the substring ending here
                cur = next;
                break;
            }
            k -= countPaths[next];
        }
    }
    return result;
}

For multiple strings, you can build a generalized suffix automaton (also called DAWG—Directed Acyclic Word Graph) by resetting last to 0 between strings and tracking which strings reach each state.

Performance Analysis and Comparison

Construction runs in O(n) amortized time. Each character triggers at most one walk up suffix links, and the total number of suffix link traversals across all insertions is bounded by O(n).

Space usage is O(n) states and O(n) transitions. With std::map for transitions, expect O(n log σ) time for operations where σ is alphabet size. For small alphabets, use arrays instead of maps for O(1) transitions.

Compared to suffix arrays with LCP arrays, suffix automata offer faster substring existence checks (O(m) vs O(m log n)) but use more memory. Suffix trees provide similar query times but typically consume 3-5x more memory than suffix automata.

Choose suffix automata when you need online construction, when memory is constrained but you need fast queries, or when counting/enumerating substrings is your primary operation.

Practical Considerations

Memory optimization: Replace std::map with a fixed-size array for small alphabets. Pre-allocate the states vector if you know the string length.

Debugging: Verify invariants after construction. Every non-root state should have exactly one incoming suffix link. The sum of (len - link.len) across all states should equal the distinct substring count.

Common pitfalls: Forgetting to handle the empty string case, off-by-one errors in the clone condition, and not resetting last when building generalized automata.

In production, suffix automata power autocomplete systems, DNA sequence analysis tools, and plagiarism detection engines. Their combination of theoretical elegance and practical efficiency makes them an essential tool for any engineer working with string algorithms.