18/06/2021

[Java] Longest common substring

Given two strings, find the longest common substring, if any.

We can simply scan both strings at the same time starting at each position and find our answer in O(NM^2) time and constant space.

However, we notice that we potentially redo work that was already done at previous iterations, we can therefore use extra O(NM) space to cache this work and reduce our runtime to O(NM) only.

We cache the max length of matching substrings for each pair of indices i,j in the two strings.

If we are at the beginning of a new substring, the first match will be of max length 1, otherwise subsequent matches will be previous max length + 1. For example

        01234
        abcba
        dbcbe


Starting at a, there is no match for it in the other string

Starting at b, we find a first match in position 1 in the other string, then a second match in position 3

Starting at c, we find a first match in position 2 in the other string, however at the previous iteration we had already found a match (i-1, j-1), therefore we expand on that match

Starting at b, we find a first match in position 1 in the other string, with no prior match for that position, the max length is 1. The second match in position 3 instead is expanding on the previous match for the string "bc", therefore we expand on that length. Our matrix looks like:

  0 1 2 3 4

0 0 0 0 0 0

1 0 1 0 1 0

2 0 0 2 0 0

3 0 1 0 3 0

4 0 0 0 0 0


While doing this, we keep track of the longest substring found each time and of the index where the longest substring ends.

After we scanned the two strings, our answer is the substring starting at end - length + 1 (to include first character) and ending at end + 1 (to include last character).

You can check my implementation of longestCommonSubstring on my Gist along with some tests in LongestCommonSubstringJTests.

Spoiler alert: using suffix trees and Ukkonen's algorithm, this can be solved in O(N+M) time instead.

No comments:

Post a Comment

With great power comes great responsibility