Longest Common Substring: DP Table Approach
The longest common substring problem asks a straightforward question: given two strings, what's the longest contiguous sequence of characters that appears in both? This differs fundamentally from the...
Key Insights
- The longest common substring problem uses a 2D DP table where each cell represents the length of a common substring ending at those positions—matching characters extend the diagonal, mismatches reset to zero.
- Unlike longest common subsequence, substrings must be contiguous, which actually makes the DP recurrence simpler: you only look at one previous cell instead of three.
- Space optimization reduces memory from O(n×m) to O(min(n,m)) by keeping only the previous row, though this makes extracting the actual substring more complex.
Introduction & Problem Definition
The longest common substring problem asks a straightforward question: given two strings, what’s the longest contiguous sequence of characters that appears in both? This differs fundamentally from the longest common subsequence problem, where characters don’t need to be adjacent—a distinction that trips up many engineers in interviews.
Consider these two strings:
s1 = "ABABC"
s2 = "BABCA"
# Longest common SUBSTRING: "BABC" (length 4)
# Longest common SUBSEQUENCE: "BABC" or "ABBC" (length 4, but found differently)
In this case, they happen to match, but the algorithms are entirely different. Substring matching requires contiguity; subsequence matching allows gaps.
This problem appears everywhere in practice. Diff tools use it to find unchanged blocks between file versions. Plagiarism detection systems identify copied passages. Bioinformatics applications find matching segments in DNA sequences. Understanding the DP approach gives you a foundation for tackling all of these.
Naive Approach & Its Limitations
The brute force solution is intuitive: generate every substring of the first string, then check if each appears in the second string. Track the longest match.
def lcs_naive(s1: str, s2: str) -> str:
longest = ""
n = len(s1)
# Generate all substrings of s1
for i in range(n):
for j in range(i + 1, n + 1):
substring = s1[i:j]
# Check if this substring exists in s2
if substring in s2 and len(substring) > len(longest):
longest = substring
return longest
# Works, but painfully slow
result = lcs_naive("ABABC", "BABCA")
print(result) # "BABC"
This approach has brutal time complexity. Generating all substrings takes O(n²) iterations. The in operator for checking substring existence runs in O(m) time in Python (using optimized algorithms internally, but still linear). Each substring comparison itself can take O(n) time. The overall complexity lands somewhere around O(n²m) to O(n³) depending on implementation details.
For strings of a few hundred characters, this might be acceptable. For DNA sequences with millions of base pairs, it’s unusable. We need dynamic programming.
DP Table Intuition & Recurrence Relation
The DP approach builds a 2D table where dp[i][j] represents the length of the longest common substring ending at position i in string 1 and position j in string 2. This “ending at” framing is crucial—it’s what makes the recurrence work.
The core insight is simple: if the characters at positions i and j match, we can extend whatever common substring ended at positions i-1 and j-1. If they don’t match, no common substring can end at both positions, so we reset to zero.
Recurrence:
dp[i][j] = dp[i-1][j-1] + 1 if s1[i-1] == s2[j-1]
dp[i][j] = 0 otherwise
Here’s how the table looks for s1 = "ABAB" and s2 = "BABA":
"" B A B A
"" 0 0 0 0 0
A 0 0 1 0 1
B 0 1 0 2 0
A 0 0 2 0 3
B 0 1 0 3 0
Read this table carefully. The cell at row “B” (second B in ABAB) and column “B” (second B in BABA) contains 3. This means there’s a common substring of length 3 ending at those positions: “ABA”.
Notice how the non-zero values always form diagonals. This makes sense—extending a common substring means moving forward in both strings simultaneously. The maximum value in the entire table (3) gives us the answer.
Implementation Walkthrough
Let’s build a complete implementation that returns both the length and the actual substring:
def longest_common_substring(s1: str, s2: str) -> tuple[int, str]:
"""
Find the longest common substring between two strings.
Returns (length, substring).
"""
if not s1 or not s2:
return 0, ""
n, m = len(s1), len(s2)
# Create DP table with an extra row/column for base case
# dp[i][j] = length of LCS ending at s1[i-1] and s2[j-1]
dp = [[0] * (m + 1) for _ in range(n + 1)]
max_length = 0
end_index = 0 # Ending index in s1
# Fill the table
for i in range(1, n + 1):
for j in range(1, m + 1):
if s1[i - 1] == s2[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
# Track the maximum
if dp[i][j] > max_length:
max_length = dp[i][j]
end_index = i # Position in s1 (1-indexed in table)
# else: dp[i][j] remains 0
# Extract the substring
# end_index is where the substring ends in s1 (exclusive, 1-indexed)
# So in 0-indexed terms: substring is s1[end_index - max_length : end_index]
substring = s1[end_index - max_length : end_index]
return max_length, substring
# Test it
s1 = "ABABC"
s2 = "BABCA"
length, substr = longest_common_substring(s1, s2)
print(f"Length: {length}, Substring: '{substr}'")
# Output: Length: 4, Substring: 'BABC'
The implementation tracks two pieces of information as it fills the table: the maximum length found so far and where that maximum ends in the first string. With these two values, extracting the actual substring is straightforward arithmetic.
Here’s the same algorithm in Java for those who prefer it:
public class LongestCommonSubstring {
public static String find(String s1, String s2) {
if (s1 == null || s2 == null || s1.isEmpty() || s2.isEmpty()) {
return "";
}
int n = s1.length(), m = s2.length();
int[][] dp = new int[n + 1][m + 1];
int maxLength = 0;
int endIndex = 0;
for (int i = 1; i <= n; i++) {
for (int j = 1; j <= m; j++) {
if (s1.charAt(i - 1) == s2.charAt(j - 1)) {
dp[i][j] = dp[i - 1][j - 1] + 1;
if (dp[i][j] > maxLength) {
maxLength = dp[i][j];
endIndex = i;
}
}
}
}
return s1.substring(endIndex - maxLength, endIndex);
}
}
Complexity Analysis
The DP solution offers a significant improvement over the naive approach:
Time Complexity: O(n × m)
We iterate through every cell in the table exactly once. Each cell requires constant-time work: one comparison and possibly one addition. For two strings of length 1000, that’s 1 million operations—fast enough for most applications.
Space Complexity: O(n × m)
The 2D table dominates memory usage. For those same 1000-character strings, we need storage for 1 million integers. At 4 bytes per integer, that’s roughly 4 MB. Manageable, but it adds up quickly for longer strings.
Compared to the naive O(n²m) time complexity, the DP approach is dramatically faster. The trade-off is memory usage—the naive approach uses O(1) extra space if you generate substrings on the fly.
Space Optimization
Here’s a key observation: when computing row i of the DP table, we only ever look at row i-1. We never need rows i-2, i-3, etc. This means we can reduce space complexity to O(m) by keeping only two rows—or even O(min(n, m)) by iterating over the shorter string for columns.
There’s a catch: we must iterate through the inner loop backwards when using a single array, because we need the old value at j-1 before we overwrite it.
def lcs_space_optimized(s1: str, s2: str) -> int:
"""
Space-optimized version returning only the length.
Uses O(min(n, m)) space.
"""
# Ensure s2 is the shorter string for minimal space
if len(s1) < len(s2):
s1, s2 = s2, s1
n, m = len(s1), len(s2)
# Single row, representing the "previous" row's values
prev = [0] * (m + 1)
max_length = 0
for i in range(1, n + 1):
# Current row, computed from prev
curr = [0] * (m + 1)
for j in range(1, m + 1):
if s1[i - 1] == s2[j - 1]:
curr[j] = prev[j - 1] + 1
max_length = max(max_length, curr[j])
prev = curr # Move to next row
return max_length
# Alternative: truly single array with backward iteration
def lcs_single_array(s1: str, s2: str) -> int:
if len(s1) < len(s2):
s1, s2 = s2, s1
n, m = len(s1), len(s2)
dp = [0] * (m + 1)
max_length = 0
for i in range(1, n + 1):
# Iterate backwards to avoid overwriting values we still need
prev_diagonal = 0
for j in range(1, m + 1):
temp = dp[j]
if s1[i - 1] == s2[j - 1]:
dp[j] = prev_diagonal + 1
max_length = max(max_length, dp[j])
else:
dp[j] = 0
prev_diagonal = temp
return max_length
The trade-off is clear: we lose the ability to easily reconstruct the actual substring. The table no longer contains the information needed to trace back through the matches. If you need the substring itself, you’d have to store additional tracking information, partially negating the space savings.
Wrap-Up & Extensions
The DP table approach for longest common substring is a textbook example of trading space for time. The O(n × m) time complexity makes it practical for strings up to tens of thousands of characters. The space optimization brings memory usage down to linear when you only need the length.
Key takeaways:
- The recurrence is simpler than longest common subsequence because you only check the diagonal—no max of three cells.
- Track both the maximum length and its ending position to extract the actual substring.
- Space optimization is possible but complicates substring extraction.
For related problems, longest common subsequence uses a similar table structure but with a different recurrence: dp[i][j] = max(dp[i-1][j], dp[i][j-1]) when characters don’t match. Edit distance extends this further by adding insertion, deletion, and substitution costs.
For truly massive strings—genome sequences with billions of base pairs—even O(n × m) becomes prohibitive. In those cases, suffix trees or suffix arrays provide O(n + m) solutions after O(n) or O(n log n) preprocessing. But for the vast majority of practical applications, the DP table approach hits the sweet spot of simplicity and efficiency.