23/06/2021

[Java] String edit distance (Levenshtein distance)

Given two strings, find their edit distance. Edits are: character insertion, deletion, replacement. The edit distance is the minimum amount of edits necessary to reach the desired goal.

This sounds like a graph problem, where we start from a string and want to reach another string through a series of modifications. Either BFS or DFS can be used.

In this approach we use BFS and a queue containing the pairs(string, number of edits so far).

We see three possible cases:

(1) start string is longer than target string: in this case we need to remove one or more characters from the start. We generate all strings that result from deleting one character from the start and them to the queue.

(2) start string is shorter than target string: in this case we need to add one or more characters to the start. Adding every possible character is not convenient and we also don't need to do that. We can add a wildcard character which we will replace with the appropriate one later on

(3) start and target string have same length: in this case we walk over the strings at the same time and count how many characters are different, those are the replacements we need to do. If we encounter a wildcard character, we replace it for free as we could have chosen this exact letter when we inserted it

By tracking all strings we generate, we avoid adding duplicates to the queue, this could happen when strings have repeating characters or one is at least two characters shorter than the other. When a string is removed from the queue, we can also remove it from the set as it will be impossible to recreate again since we're constantly adding or removing characters, which means we can't generate a same length string after each step.

Consider that creating all substrings from a string is O(N^2) time, this also applies to adding a character to a string. Additionally, the hashing of a String value might be O(N) time as it's an array of chars behind the scenes.

Assuming the longest string is M and shortest is N, our runtime is O((M-N) x M^2 + (2^(M-N)) x N) as we perform M^2 insert/delete character operations for M-N times until we generate all possible strings of same length. The number of new strings we generate is 2^(M-N) as we have a choice to either insert/delete one character each time or not, and for each of those we do a string scan of the whole shorter string in N time.

I guess we can say when M-N << M and N then our runtime is O(M^2 + N)

We handle special cases such as empty and null strings before our logic so we can return those in constant time.

If strings have same length already, algorithm runs in linear time as we don't generate anything and perform only one pass over the string.

Space complexity is O(2^(M-N))  as at most in the queue and set we store the amount of elements we generate at the lowest level of our traversal.

You can check my implementation of editDistance on my Gist along with some tests in EditDistanceJTests.

No comments:

Post a Comment

With great power comes great responsibility