Pages

28/06/2021

[Java] Longest common subsequence

Give two strings, find the longest common subsequence. A subsequence is a sequence of characters, not necessarily adjacent, that appear in a specific order in both strings.

Example:

sda, dlolsla

longest common subsequence is "sa" of length 2, other subsequences are "d", "s" and "a".

I find that if we are not required to produce the string representing the longest subsequence but only the length, a top down recursive memoized approach is the easiest to come up with.

It requires us to:

  • find a recursive approach to solve the problem in O(2^input) time: decide whether to include or not a character from each string in a subsequence, in this case it would be O(2^(M+N))
  • check which parameters we are using for the recursion: in this case the indexes of the char to evaluate for both strings
  • create a cache for those parameters: either a matrix or a hashtable with a tuple as key

For this problem, the choice we make each time is:

  • if both strings have a common character, we are increasing the current best length and continue recursively on the next character from both strings
  • otherwise, we need to explore the two cases where we move to the next character on one string and stay put on the current in the other, picking the choice that has returned the best result

By adding a cache where we track the best result for a given pair of indices, we can reduce our problem to O(MN) time and space complexity as at most we have to check once all pair of characters from both strings.

The top down looks at cache values from NEXT computation we would perform (which is the result of a recursion).

If instead we are required to produce the string representing the subsequence as well, then a bottom up approach is easier and can be built using a similar criteria to the one applied to solve the top down approach.

In this case, our cache will be of size N+1xM+1 and we start by saying that strings of length 0 have 0 as best length for a common subsequence. Then we expand deciding whether to add a character or not from either string based on the best result we would get by doing so.

We can see that, as before, we want to include a character if both strings have that character as part of a subsequence, and if we do so, we are actually expanding on the previous best subsequent that led us there.

Optional: anytime this happens, we record in a separate matrix or same size this diagonal move (best previous subsequence for a pair of indices is in the previous diagonal value in our cache). Otherwise we choose to continue on either string based on the best result we had at the PREVIOUS character for each (which is already in the cache since it was calculated at a previous step).

Since we are using a matrix representation, for the string we set on the rows, the best previous choice will be "up" (the previous row), while for the string we set on the columns, the best previous choice will be "left" (the previous column).

At the end, we will have our best length in cache[N][M] and will have the directions to reconstruct the subsequence starting from the last character in the other matrix.

We can then start from the bottom right corner of that matrix and follow our instructions, whenever we encounter a diagonal move, we add the character in that position from either string to our result, stopping as soon as we no longer find any instruction we left.

This process takes O(M+N) time as each instruction will make us move back a row, a column or both, each time. 

Improved reconstructing approach: by looking at both the cache and backtrack matrix, we can notice a pattern which will allow us to reconstruct the subsequence without using the backtrack matrix at all:

  • if the characters from both string at positions a_idx and b_idx are equal, we would have moved diagonally, therefore add that character to the result and move BACK on both strings
  • if we chose to pick the character from string A over B (cache[a_idx - 1][b_idx] >= cache[a_idx][b_idx - 1]) then we move BACK on string A, otherwise on string B

We have same complexity as before, but one less matrix to deal with.

The resulting string will be in reverse order (from last to first character), so we need to reverse if before returning it. This is then done in O(S) time where S is the length of the subsequence.

Top down reconstructing approach: we can also reconstruct the subsequence from the top down cache, in this case we would start from the FIRST matching character in both strings, then:

  • if the characters from both string at positions a_idx and b_idx are equal, we would have moved diagonally, therefore add that character to the result and move FORWARD on both strings
  • if we chose to pick the character from string A over B (cache[a_idx + 1][b_idx] >= cache[a_idx][b_idx + 1]) then we move FORWARD on string A, otherwise on string B
  • if we reached the end of either string, we cannot perform the check above as we would go out of bounds, therefore advance only on the string that is NOT yet at the end. If both are at the end, advance on either so that we break out of the loop

In all approaches, our complexity is O(MN) for time and space for finding the solution and O(M+N) time for reconstructing it.

You can check my implementation of longestCommonSubsequence and longestCommonSubsequenceString on my Gist along with some tests in LongestCommonSubsequenceJTests.

No comments:

Post a Comment

With great power comes great responsibility