Pages

23/06/2021

[Java] Wagner-Fischer algorithm for string edit distance (Levenshtein distance)

We've already seen a way to calculate the edit distance between two strings, which was a bit laborious, turns out there is a nice compact algorithm invented by Wagner and Fischer to achieve the same using dynamic programming with O(NM) time and space complexity.

We create a N+1xM+1 matrix representing our strings with the empty string prepended. We then walk the matrix and determine the cost of generating the rest of the string given the cost of reaching the current edit for the substring we are considering at each point.

The idea is this: 

  • from the empty string to reach the other string we need to insert all characters 
  • from a full string to reach the empty string we need to delete all characters

We can initialize this with a two single loops on M and N (reason why the matrix is size of both strings + 1), then for each position in our matrix, we verify the cost of transforming the current prefixes to have a match by inserting/deleting/replacing the current character.

If the current character in the prefixes is same, we have no work to do, otherwise we must do 1 replacement operation, then we pick the minumum cost between:

  • inserting an additional character in this position
  • deleteing a character from this position
  • replacing a character in this position

in the first two cases, the cost is the cost of generating this prefix from the previous step + 1 insertion/deletion and in the last case it is the cost of generating this prefix from the previous step + cost of replacement (if characters were same, it's 0, otherwise 1).

We will have our final answer in position MxN in our matrix.

You can check my implementation of editDistanceWagnerFischer on my Gist along with some tests in EditDistanceWagnerFischerJTests.

No comments:

Post a Comment

With great power comes great responsibility