Longest Common Subsequence: DP Solution

The Longest Common Subsequence (LCS) problem asks a deceptively simple question: given two strings, what's the longest sequence of characters that appears in both, in the same order, but not...

Key Insights

  • The Longest Common Subsequence problem exhibits both optimal substructure and overlapping subproblems, making it a textbook case for dynamic programming optimization
  • A naive recursive solution runs in O(2^n) time, but the DP approach reduces this to O(m×n) by storing solutions to subproblems in a 2D table
  • Space complexity can be further optimized from O(m×n) to O(min(m,n)) since each row only depends on the previous row

Introduction & Problem Definition

The Longest Common Subsequence (LCS) problem asks a deceptively simple question: given two strings, what’s the longest sequence of characters that appears in both, in the same order, but not necessarily contiguously?

This distinction between subsequence and substring matters. A substring must be contiguous—“abc” is a substring of “xabcy”. A subsequence can skip characters—“ace” is a subsequence of “abcde” because you can pick ‘a’, skip ‘b’, pick ‘c’, skip ’d’, pick ’e’.

You’ve used LCS without knowing it. When git diff shows you what changed between commits, it’s finding the longest common subsequence of lines to minimize the displayed differences. DNA sequencing tools use LCS variants to align genetic sequences. Spell checkers use related algorithms to suggest corrections. Understanding LCS gives you a foundation for an entire family of string comparison problems.

Naive Recursive Approach

The recursive solution follows naturally from the problem definition. Compare the last characters of both strings. If they match, that character is part of the LCS, and we recurse on the remaining prefixes. If they don’t match, we try removing a character from either string and take the better result.

def lcs_recursive(s1: str, s2: str, i: int, j: int) -> int:
    """
    Naive recursive LCS implementation.
    i, j are the lengths of prefixes we're considering.
    """
    # Base case: one string is empty
    if i == 0 or j == 0:
        return 0
    
    # Characters match: include in LCS
    if s1[i - 1] == s2[j - 1]:
        return 1 + lcs_recursive(s1, s2, i - 1, j - 1)
    
    # Characters don't match: try both options
    return max(
        lcs_recursive(s1, s2, i - 1, j),
        lcs_recursive(s1, s2, i, j - 1)
    )

# Usage
s1, s2 = "ABCDGH", "AEDFHR"
print(lcs_recursive(s1, s2, len(s1), len(s2)))  # Output: 3 (ADH)

This works, but it’s catastrophically slow. The time complexity is O(2^(m+n)) in the worst case. Each non-matching comparison branches into two recursive calls. For strings of length 20, you’re looking at millions of operations. For length 100, you’ll be waiting until the heat death of the universe.

Understanding the Optimal Substructure

The recursive solution fails because it recomputes the same subproblems repeatedly. Let’s expose this waste:

call_count = {}

def lcs_with_logging(s1: str, s2: str, i: int, j: int) -> int:
    key = (i, j)
    call_count[key] = call_count.get(key, 0) + 1
    
    if i == 0 or j == 0:
        return 0
    
    if s1[i - 1] == s2[j - 1]:
        return 1 + lcs_with_logging(s1, s2, i - 1, j - 1)
    
    return max(
        lcs_with_logging(s1, s2, i - 1, j),
        lcs_with_logging(s1, s2, i, j - 1)
    )

s1, s2 = "ABCDEF", "AZBYCX"
lcs_with_logging(s1, s2, len(s1), len(s2))

# Show repeated computations
repeated = {k: v for k, v in call_count.items() if v > 1}
print(f"Subproblems computed multiple times: {len(repeated)}")
print(f"Most repeated: {max(repeated.values())} times")

Run this and you’ll see subproblems computed dozens of times. The key insight: there are only O(m×n) unique subproblems (one for each pair of prefix lengths), but we’re solving them exponentially many times.

The recurrence relation formalizes this:

LCS(i, j) = 0                           if i = 0 or j = 0
          = 1 + LCS(i-1, j-1)           if s1[i-1] = s2[j-1]
          = max(LCS(i-1, j), LCS(i, j-1)) otherwise

This is the optimal substructure: the solution to the whole problem is built from solutions to smaller subproblems.

Bottom-Up DP Solution (Tabulation)

Instead of recursing and memoizing, we build the solution bottom-up. Create a 2D table where dp[i][j] represents the LCS length for the first i characters of s1 and first j characters of s2.

def lcs_dp(s1: str, s2: str) -> int:
    """
    Bottom-up DP solution for LCS.
    Time: O(m × n), Space: O(m × n)
    """
    m, n = len(s1), len(s2)
    
    # Create table with extra row/column for base cases
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    # Fill the table row by row
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                # Characters match: extend the LCS
                dp[i][j] = 1 + dp[i - 1][j - 1]
            else:
                # Take the best from excluding either character
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    return dp[m][n]

Let’s visualize the table for s1 = "AGCAT" and s2 = "GAC":

        ""  G   A   C
    ""   0   0   0   0
    A    0   0   1   1
    G    0   1   1   1
    C    0   1   1   2
    A    0   1   2   2
    T    0   1   2   2

Reading the table: dp[5][3] = 2, meaning the LCS of “AGCAT” and “GAC” has length 2. The actual subsequence is “AC” or “GA”—we’ll extract it next.

Each cell depends only on three neighbors: above, left, and diagonal. This dependency structure is what enables both the bottom-up fill order and the space optimization we’ll see later.

Reconstructing the Actual Subsequence

Knowing the length is often insufficient. Diff tools need the actual common lines. DNA analysis needs the matching sequence. Here’s how to backtrack through the table:

def lcs_with_backtrack(s1: str, s2: str) -> tuple[int, str]:
    """
    Returns both the LCS length and the actual subsequence.
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    # Build the DP table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                dp[i][j] = 1 + dp[i - 1][j - 1]
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Backtrack to find the actual subsequence
    i, j = m, n
    lcs_chars = []
    
    while i > 0 and j > 0:
        if s1[i - 1] == s2[j - 1]:
            # This character is part of the LCS
            lcs_chars.append(s1[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            # Came from above
            i -= 1
        else:
            # Came from left
            j -= 1
    
    # Reverse since we built it backwards
    lcs_string = ''.join(reversed(lcs_chars))
    return dp[m][n], lcs_string

# Example
length, subsequence = lcs_with_backtrack("AGGTAB", "GXTXAYB")
print(f"Length: {length}, LCS: {subsequence}")  # Length: 4, LCS: GTAB

The backtracking logic mirrors the table construction: when characters match, we move diagonally and record the character. When they don’t, we follow the larger value (the path we took during construction).

Space Optimization

The full 2D table uses O(m×n) space. For comparing two 10,000-character strings, that’s 100 million integers—roughly 400MB. We can do better.

Notice that each row only depends on the previous row. We don’t need the entire history, just the last row:

def lcs_space_optimized(s1: str, s2: str) -> int:
    """
    Space-optimized LCS using two rows.
    Time: O(m × n), Space: O(min(m, n))
    """
    # Ensure s2 is the shorter string for minimal space
    if len(s1) < len(s2):
        s1, s2 = s2, s1
    
    m, n = len(s1), len(s2)
    
    # Only need two rows
    prev = [0] * (n + 1)
    curr = [0] * (n + 1)
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                curr[j] = 1 + prev[j - 1]
            else:
                curr[j] = max(prev[j], curr[j - 1])
        
        # Swap rows for next iteration
        prev, curr = curr, prev
    
    # Result is in prev because of the final swap
    return prev[n]

This reduces space to O(min(m, n)). The tradeoff: you lose the ability to backtrack and recover the actual subsequence without additional bookkeeping.

For applications that only need the length (like similarity scoring), this optimization is pure upside. When you need the subsequence, stick with the full table or implement a more complex reconstruction algorithm.

Wrap-Up & Extensions

LCS is a gateway to a family of related problems:

Longest Common Substring requires contiguous matches. The DP approach is similar, but you reset to zero when characters don’t match instead of taking the max.

Edit Distance (Levenshtein) counts the minimum insertions, deletions, and substitutions to transform one string into another. The table structure is nearly identical to LCS.

Shortest Common Supersequence finds the shortest string that contains both inputs as subsequences. Length = m + n - LCS(s1, s2).

For practice, try these problems:

  • LeetCode 1143: Longest Common Subsequence
  • LeetCode 583: Delete Operation for Two Strings
  • LeetCode 712: Minimum ASCII Delete Sum for Two Strings

The pattern you’ve learned—identifying overlapping subproblems, defining a recurrence relation, building a table bottom-up—applies to hundreds of DP problems. Master LCS and you’ve mastered the template.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.