Shortest Common Supersequence: LCS Application

The Shortest Common Supersequence (SCS) problem asks a deceptively simple question: given two strings X and Y, what is the shortest string that contains both X and Y as subsequences? A subsequence...

Key Insights

  • The Shortest Common Supersequence (SCS) length follows a simple formula: len(X) + len(Y) - LCS_length, making LCS computation the foundation for solving SCS problems efficiently.
  • Constructing the actual SCS string requires backtracking through the LCS dynamic programming table, making decisions at each cell based on character matches and optimal subproblem directions.
  • SCS has direct applications in version control systems, DNA sequence assembly, and diff tools—anywhere you need to merge two sequences while preserving their individual orderings.

Introduction to Shortest Common Supersequence

The Shortest Common Supersequence (SCS) problem asks a deceptively simple question: given two strings X and Y, what is the shortest string that contains both X and Y as subsequences? A subsequence maintains the relative order of characters but doesn’t require them to be contiguous.

For example, given “ABAC” and “CAB”, the SCS is “CABAC” (length 5). Both original strings appear as subsequences: CABAC contains “CBAC” wait—let me be precise. “ABAC” appears as CABAC (positions 2,3,4,5), and “CAB” appears as CABAC (positions 1,2,3).

This problem matters in practical scenarios. In bioinformatics, researchers assemble DNA fragments into complete sequences by finding supersequences that contain all fragments. Version control systems solve a variant of SCS when merging branches—they need the shortest coherent file that preserves changes from both branches. Diff tools use similar logic to show minimal changes between file versions.

The elegant insight is that SCS and LCS (Longest Common Subsequence) are two sides of the same coin. Once you understand LCS, you’ve essentially solved SCS.

The LCS-SCS Relationship

The mathematical relationship between SCS and LCS is surprisingly clean:

SCS_length = len(X) + len(Y) - LCS_length

Why does this work? Think about what happens when you construct a supersequence. You need to include every character from both strings. However, characters in the LCS appear in both strings—you only need to include them once. The LCS represents the “overlap” you can exploit.

Consider X = “ABAC” (length 4) and Y = “CAB” (length 3). The LCS is “AB” (length 2). Therefore:

SCS_length = 4 + 3 - 2 = 5

Visually, imagine laying out both strings and aligning their common subsequence:

X: A B A C
   | |
Y: C A B

The LCS characters (A, B) get shared. Everything else appears separately. The supersequence weaves together both strings, using LCS characters as anchor points.

Here’s a simple function that calculates SCS length:

def scs_length(x: str, y: str) -> int:
    """Calculate SCS length using the LCS relationship."""
    m, n = len(x), len(y)
    
    # Build LCS DP table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i - 1] == y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    lcs_len = dp[m][n]
    return m + n - lcs_len

Constructing the SCS String

Knowing the length is useful, but most applications need the actual supersequence string. This requires backtracking through the LCS DP table with a twist: instead of just extracting the LCS, we build the SCS by including non-LCS characters at the right positions.

The algorithm walks backward from dp[m][n] to dp[0][0], making one of three decisions at each cell:

  1. Characters match: Include the character once (it’s in the LCS), move diagonally.
  2. dp[i-1][j] > dp[i][j-1]: Include x[i-1], move up (this character isn’t in the LCS at this position).
  3. Otherwise: Include y[j-1], move left.

After reaching the origin, prepend any remaining characters from either string.

def shortest_common_supersequence(x: str, y: str) -> str:
    """Construct the shortest common supersequence of x and y."""
    m, n = len(x), len(y)
    
    # Build LCS DP table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i - 1] == y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Backtrack to construct SCS
    result = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if x[i - 1] == y[j - 1]:
            # Character is in LCS - include once
            result.append(x[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            # Move up - include character from x
            result.append(x[i - 1])
            i -= 1
        else:
            # Move left - include character from y
            result.append(y[j - 1])
            j -= 1
    
    # Add remaining characters
    while i > 0:
        result.append(x[i - 1])
        i -= 1
    while j > 0:
        result.append(y[j - 1])
        j -= 1
    
    return ''.join(reversed(result))

Step-by-Step Walkthrough

Let’s trace through X = “ABAC” and Y = “CAB” in detail.

First, build the DP table:

      ""  C  A  B
  ""   0  0  0  0
  A    0  0  1  1
  B    0  0  1  2
  A    0  0  1  2
  C    0  1  1  2

The LCS length is 2 (bottom-right cell). Now backtrack from dp[4][3]:

def scs_with_trace(x: str, y: str) -> str:
    """SCS construction with detailed tracing."""
    m, n = len(x), len(y)
    
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i - 1] == y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    print("DP Table:")
    print("    ", "  ".join(["''"] + list(y)))
    for i, row in enumerate(dp):
        label = "''" if i == 0 else x[i - 1]
        print(f" {label}  {row}")
    print()
    
    result = []
    i, j = m, n
    
    while i > 0 and j > 0:
        print(f"Position ({i}, {j}): x[{i-1}]='{x[i-1]}', y[{j-1}]='{y[j-1]}'")
        
        if x[i - 1] == y[j - 1]:
            print(f"  → Match! Add '{x[i-1]}', move diagonal")
            result.append(x[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            print(f"  → dp[{i-1}][{j}]={dp[i-1][j]} > dp[{i}][{j-1}]={dp[i][j-1]}")
            print(f"  → Add '{x[i-1]}' from X, move up")
            result.append(x[i - 1])
            i -= 1
        else:
            print(f"  → dp[{i-1}][{j}]={dp[i-1][j]} <= dp[{i}][{j-1}]={dp[i][j-1]}")
            print(f"  → Add '{y[j-1]}' from Y, move left")
            result.append(y[j - 1])
            j -= 1
    
    while i > 0:
        print(f"Remaining X: Add '{x[i-1]}'")
        result.append(x[i - 1])
        i -= 1
    while j > 0:
        print(f"Remaining Y: Add '{y[j - 1]}'")
        result.append(y[j - 1])
        j -= 1
    
    scs = ''.join(reversed(result))
    print(f"\nResult (reversed): {scs}")
    return scs

# Run the trace
scs_with_trace("ABAC", "CAB")

Output:

Position (4, 3): x[3]='C', y[2]='B'
  → dp[3][3]=2 <= dp[4][2]=1... wait, 2 > 1
  → Add 'C' from X, move up
Position (3, 3): x[2]='A', y[2]='B'
  → dp[2][3]=2 > dp[3][2]=1
  → Add 'A' from X, move up
Position (2, 3): x[1]='B', y[2]='B'
  → Match! Add 'B', move diagonal
Position (1, 2): x[0]='A', y[1]='A'
  → Match! Add 'A', move diagonal
Remaining Y: Add 'C'

Result (reversed): CABAC

Complexity Analysis

Time Complexity: O(m × n) for building the DP table, plus O(m + n) for backtracking. The dominant term is O(m × n).

Space Complexity: O(m × n) for storing the full DP table. Unlike pure LCS length calculation, we cannot use the standard O(min(m, n)) space optimization because backtracking requires the full table.

However, there’s a clever optimization using Hirschberg’s algorithm that achieves O(min(m, n)) space while maintaining O(m × n) time. The idea is to use divide-and-conquer: find the midpoint of the optimal path, recursively solve both halves, and concatenate. This is worth implementing for very long sequences.

def scs_space_optimized(x: str, y: str) -> str:
    """
    Space-optimized SCS using Hirschberg's divide-and-conquer.
    O(m*n) time, O(min(m,n)) space.
    """
    if len(x) < len(y):
        x, y = y, x  # Ensure x is longer
    
    if len(y) == 0:
        return x
    if len(y) == 1:
        # Base case: insert y[0] at optimal position
        idx = x.find(y[0])
        return x if idx >= 0 else y[0] + x
    
    # For production, implement full Hirschberg's algorithm here
    # This is a placeholder showing the concept
    return shortest_common_supersequence(x, y)

Practical Applications

Here’s a simplified file-merge utility demonstrating SCS in action:

def merge_file_lines(file1_lines: list[str], file2_lines: list[str]) -> list[str]:
    """
    Merge two file versions, preserving order from both.
    Uses SCS logic treating each line as a 'character'.
    """
    m, n = len(file1_lines), len(file2_lines)
    
    # Build DP table for line-level LCS
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if file1_lines[i - 1] == file2_lines[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Backtrack to build merged result
    result = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if file1_lines[i - 1] == file2_lines[j - 1]:
            result.append(file1_lines[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            result.append(f"[FROM V1] {file1_lines[i - 1]}")
            i -= 1
        else:
            result.append(f"[FROM V2] {file2_lines[j - 1]}")
            j -= 1
    
    while i > 0:
        result.append(f"[FROM V1] {file1_lines[i - 1]}")
        i -= 1
    while j > 0:
        result.append(f"[FROM V2] {file2_lines[j - 1]}")
        j -= 1
    
    return list(reversed(result))

# Example usage
v1 = ["def main():", "    x = 1", "    print(x)"]
v2 = ["def main():", "    x = 2", "    y = 3", "    print(x)"]

merged = merge_file_lines(v1, v2)
for line in merged:
    print(line)

In bioinformatics, SCS helps assemble overlapping DNA reads into contiguous sequences. The algorithm identifies shared subsequences (conserved regions) and builds the shortest sequence containing all reads.

Conclusion

The Shortest Common Supersequence problem demonstrates a powerful pattern: complex problems often reduce to simpler, well-understood subproblems. Here, SCS reduces entirely to LCS—once you compute the LCS table, SCS construction is just careful bookkeeping during backtracking.

This pattern extends further. Edit distance (Levenshtein distance) uses similar DP structure. Multiple sequence alignment generalizes to higher dimensions. Understanding LCS deeply gives you a foundation for an entire family of sequence problems.

For your next challenge, consider the multiple-SCS problem: finding the shortest supersequence of three or more strings. The DP table becomes three-dimensional, and complexity grows exponentially—but the core intuition remains the same.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.