Shortest Common Supersequence: LCS Application
The Shortest Common Supersequence (SCS) problem asks a deceptively simple question: given two strings X and Y, what is the shortest string that contains both X and Y as subsequences? A subsequence...
Key Insights
- The Shortest Common Supersequence (SCS) length follows a simple formula:
len(X) + len(Y) - LCS_length, making LCS computation the foundation for solving SCS problems efficiently. - Constructing the actual SCS string requires backtracking through the LCS dynamic programming table, making decisions at each cell based on character matches and optimal subproblem directions.
- SCS has direct applications in version control systems, DNA sequence assembly, and diff tools—anywhere you need to merge two sequences while preserving their individual orderings.
Introduction to Shortest Common Supersequence
The Shortest Common Supersequence (SCS) problem asks a deceptively simple question: given two strings X and Y, what is the shortest string that contains both X and Y as subsequences? A subsequence maintains the relative order of characters but doesn’t require them to be contiguous.
For example, given “ABAC” and “CAB”, the SCS is “CABAC” (length 5). Both original strings appear as subsequences: CABAC contains “CBAC” wait—let me be precise. “ABAC” appears as CABAC (positions 2,3,4,5), and “CAB” appears as CABAC (positions 1,2,3).
This problem matters in practical scenarios. In bioinformatics, researchers assemble DNA fragments into complete sequences by finding supersequences that contain all fragments. Version control systems solve a variant of SCS when merging branches—they need the shortest coherent file that preserves changes from both branches. Diff tools use similar logic to show minimal changes between file versions.
The elegant insight is that SCS and LCS (Longest Common Subsequence) are two sides of the same coin. Once you understand LCS, you’ve essentially solved SCS.
The LCS-SCS Relationship
The mathematical relationship between SCS and LCS is surprisingly clean:
SCS_length = len(X) + len(Y) - LCS_length
Why does this work? Think about what happens when you construct a supersequence. You need to include every character from both strings. However, characters in the LCS appear in both strings—you only need to include them once. The LCS represents the “overlap” you can exploit.
Consider X = “ABAC” (length 4) and Y = “CAB” (length 3). The LCS is “AB” (length 2). Therefore:
SCS_length = 4 + 3 - 2 = 5
Visually, imagine laying out both strings and aligning their common subsequence:
X: A B A C
| |
Y: C A B
The LCS characters (A, B) get shared. Everything else appears separately. The supersequence weaves together both strings, using LCS characters as anchor points.
Here’s a simple function that calculates SCS length:
def scs_length(x: str, y: str) -> int:
"""Calculate SCS length using the LCS relationship."""
m, n = len(x), len(y)
# Build LCS DP table
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if x[i - 1] == y[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
lcs_len = dp[m][n]
return m + n - lcs_len
Constructing the SCS String
Knowing the length is useful, but most applications need the actual supersequence string. This requires backtracking through the LCS DP table with a twist: instead of just extracting the LCS, we build the SCS by including non-LCS characters at the right positions.
The algorithm walks backward from dp[m][n] to dp[0][0], making one of three decisions at each cell:
- Characters match: Include the character once (it’s in the LCS), move diagonally.
- dp[i-1][j] > dp[i][j-1]: Include
x[i-1], move up (this character isn’t in the LCS at this position). - Otherwise: Include
y[j-1], move left.
After reaching the origin, prepend any remaining characters from either string.
def shortest_common_supersequence(x: str, y: str) -> str:
"""Construct the shortest common supersequence of x and y."""
m, n = len(x), len(y)
# Build LCS DP table
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if x[i - 1] == y[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
# Backtrack to construct SCS
result = []
i, j = m, n
while i > 0 and j > 0:
if x[i - 1] == y[j - 1]:
# Character is in LCS - include once
result.append(x[i - 1])
i -= 1
j -= 1
elif dp[i - 1][j] > dp[i][j - 1]:
# Move up - include character from x
result.append(x[i - 1])
i -= 1
else:
# Move left - include character from y
result.append(y[j - 1])
j -= 1
# Add remaining characters
while i > 0:
result.append(x[i - 1])
i -= 1
while j > 0:
result.append(y[j - 1])
j -= 1
return ''.join(reversed(result))
Step-by-Step Walkthrough
Let’s trace through X = “ABAC” and Y = “CAB” in detail.
First, build the DP table:
"" C A B
"" 0 0 0 0
A 0 0 1 1
B 0 0 1 2
A 0 0 1 2
C 0 1 1 2
The LCS length is 2 (bottom-right cell). Now backtrack from dp[4][3]:
def scs_with_trace(x: str, y: str) -> str:
"""SCS construction with detailed tracing."""
m, n = len(x), len(y)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if x[i - 1] == y[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
print("DP Table:")
print(" ", " ".join(["''"] + list(y)))
for i, row in enumerate(dp):
label = "''" if i == 0 else x[i - 1]
print(f" {label} {row}")
print()
result = []
i, j = m, n
while i > 0 and j > 0:
print(f"Position ({i}, {j}): x[{i-1}]='{x[i-1]}', y[{j-1}]='{y[j-1]}'")
if x[i - 1] == y[j - 1]:
print(f" → Match! Add '{x[i-1]}', move diagonal")
result.append(x[i - 1])
i -= 1
j -= 1
elif dp[i - 1][j] > dp[i][j - 1]:
print(f" → dp[{i-1}][{j}]={dp[i-1][j]} > dp[{i}][{j-1}]={dp[i][j-1]}")
print(f" → Add '{x[i-1]}' from X, move up")
result.append(x[i - 1])
i -= 1
else:
print(f" → dp[{i-1}][{j}]={dp[i-1][j]} <= dp[{i}][{j-1}]={dp[i][j-1]}")
print(f" → Add '{y[j-1]}' from Y, move left")
result.append(y[j - 1])
j -= 1
while i > 0:
print(f"Remaining X: Add '{x[i-1]}'")
result.append(x[i - 1])
i -= 1
while j > 0:
print(f"Remaining Y: Add '{y[j - 1]}'")
result.append(y[j - 1])
j -= 1
scs = ''.join(reversed(result))
print(f"\nResult (reversed): {scs}")
return scs
# Run the trace
scs_with_trace("ABAC", "CAB")
Output:
Position (4, 3): x[3]='C', y[2]='B'
→ dp[3][3]=2 <= dp[4][2]=1... wait, 2 > 1
→ Add 'C' from X, move up
Position (3, 3): x[2]='A', y[2]='B'
→ dp[2][3]=2 > dp[3][2]=1
→ Add 'A' from X, move up
Position (2, 3): x[1]='B', y[2]='B'
→ Match! Add 'B', move diagonal
Position (1, 2): x[0]='A', y[1]='A'
→ Match! Add 'A', move diagonal
Remaining Y: Add 'C'
Result (reversed): CABAC
Complexity Analysis
Time Complexity: O(m × n) for building the DP table, plus O(m + n) for backtracking. The dominant term is O(m × n).
Space Complexity: O(m × n) for storing the full DP table. Unlike pure LCS length calculation, we cannot use the standard O(min(m, n)) space optimization because backtracking requires the full table.
However, there’s a clever optimization using Hirschberg’s algorithm that achieves O(min(m, n)) space while maintaining O(m × n) time. The idea is to use divide-and-conquer: find the midpoint of the optimal path, recursively solve both halves, and concatenate. This is worth implementing for very long sequences.
def scs_space_optimized(x: str, y: str) -> str:
"""
Space-optimized SCS using Hirschberg's divide-and-conquer.
O(m*n) time, O(min(m,n)) space.
"""
if len(x) < len(y):
x, y = y, x # Ensure x is longer
if len(y) == 0:
return x
if len(y) == 1:
# Base case: insert y[0] at optimal position
idx = x.find(y[0])
return x if idx >= 0 else y[0] + x
# For production, implement full Hirschberg's algorithm here
# This is a placeholder showing the concept
return shortest_common_supersequence(x, y)
Practical Applications
Here’s a simplified file-merge utility demonstrating SCS in action:
def merge_file_lines(file1_lines: list[str], file2_lines: list[str]) -> list[str]:
"""
Merge two file versions, preserving order from both.
Uses SCS logic treating each line as a 'character'.
"""
m, n = len(file1_lines), len(file2_lines)
# Build DP table for line-level LCS
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if file1_lines[i - 1] == file2_lines[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
# Backtrack to build merged result
result = []
i, j = m, n
while i > 0 and j > 0:
if file1_lines[i - 1] == file2_lines[j - 1]:
result.append(file1_lines[i - 1])
i -= 1
j -= 1
elif dp[i - 1][j] > dp[i][j - 1]:
result.append(f"[FROM V1] {file1_lines[i - 1]}")
i -= 1
else:
result.append(f"[FROM V2] {file2_lines[j - 1]}")
j -= 1
while i > 0:
result.append(f"[FROM V1] {file1_lines[i - 1]}")
i -= 1
while j > 0:
result.append(f"[FROM V2] {file2_lines[j - 1]}")
j -= 1
return list(reversed(result))
# Example usage
v1 = ["def main():", " x = 1", " print(x)"]
v2 = ["def main():", " x = 2", " y = 3", " print(x)"]
merged = merge_file_lines(v1, v2)
for line in merged:
print(line)
In bioinformatics, SCS helps assemble overlapping DNA reads into contiguous sequences. The algorithm identifies shared subsequences (conserved regions) and builds the shortest sequence containing all reads.
Conclusion
The Shortest Common Supersequence problem demonstrates a powerful pattern: complex problems often reduce to simpler, well-understood subproblems. Here, SCS reduces entirely to LCS—once you compute the LCS table, SCS construction is just careful bookkeeping during backtracking.
This pattern extends further. Edit distance (Levenshtein distance) uses similar DP structure. Multiple sequence alignment generalizes to higher dimensions. Understanding LCS deeply gives you a foundation for an entire family of sequence problems.
For your next challenge, consider the multiple-SCS problem: finding the shortest supersequence of three or more strings. The DP table becomes three-dimensional, and complexity grows exponentially—but the core intuition remains the same.