Pages

18/05/2021

[Java] Aho-Corasick Trie for string search and matching

We worked on Tries in the past and now we take a look at a string search algorithm using Aho-Corasick which gives a nice O(string_length) time AFTER a preprocessing of all the desired search strings.

I found the following videos helpful in understanding the algorithm:

Clearly this approach is most effective when we want to do repeated searches for same patterns in multiple strings.

The idea is that after we created the whole trie, we can add information to each node that would allow us to recover from partial matches in case we reach a failure on a substring match.

For this, we have to calculate using a BFS for each node where to go in case we find we are unable to continue matching from that position.

This information is called fail node or fail link. The value we set there is the node corresponding to the longest non failed match for the partial string we were looking at, starting from the root.


For example, assume these are the matches we want to find in a given string: ab, aab, ac, abc


We process them and create the following trie (* indicates a full word):


       ()

        |

        a

/       |       \

b(*)    a      c(*)

|       | 

c(*)   b(*)


Normally, when we walk down a trie we keep following links corresponding to the next character we want to match in a given string. If no link exists that we can follow, we say the string does NOT contain any match and return. This is fine if we only care about a single match, but if we consider multiple matches, we find that every time we would restart from the root of the trie trying to match a new substring.

By adding failure links, we can recover from a failure and continue from a partial match. If a failure brought us to the root, it means that particular substring does NOT appear in our matches, and the only thing we can therefore match is an empty string (the root).

In this algorithm, the root does NOT have a fail link and its children all fail back to the root.

This makes sense since failing to match on empty string () means we are still in the root, while failing to match on the very first character of a substring means that character did NOT appear in our search string, and we have to restart from scratch - the empty string.

If instead we could match a certain substring, but failed to continue from a particular node, we do not necessarily have to restart from the empty string. IF there exists another string we want to match that happens to share a PREFIX equal what we matched so far, we can continue from there instead.

Suppose in our example we are trying to look for matches in string "aabc". We can see that we expect the following results: 

  • "ab" matches from position 1
  • "aab" matches from position 0
  • "ac" has no match
  • "abc" matches from position 1

Now assume we walked the input string until we reached "aab". From there we found a whole word but cannot continue walking since there is no child with character "c", the next we are searching for. 

However, thus far we matched both "ab" and "aab" strings and we notice that we have another potential match "abc" which shares a portion with "ab".

Our fail link would NOT bring us to the root, but to the LONGEST matching string with same PREFIX starting from the root, which is node "ab".

From there, we CAN follow a link for character "c" and end up in node "abc" which is another match.

Notice that we simply move forward one node and did NOT restart from scratch, meaning that for each character in the given string we can find in constant time ALL matches from any given position.

Two important rules:

  • when we reach a node which has matching words, we matched ALL of them, eg: reaching node "abc" means we matched both "ab" and "abc"
  • if we cannot move forward from a node, we keep following failure links until we either reach a node that allows us to move forward OR we are trying to fall back from the root (empty string). In the second case, the current substring had NO matches, therefore we restart from empty string state (we stay on the root and move on to the next character in the string we are evaluating)
  • when we find a node that has matching words, the position of the start of the match is calculated with the formula:

i - len(word) + 1

where:

  • i is the index of the current character we are trying to match in the given string
  • len(word) is the length of the word we have in the current node, the one we just matched

 

You can find my implementation of findMatchingStringsInString and buildFailLinks which rely on the Trie class on my Gist along with some tests in TrieJTests.

No comments:

Post a Comment

With great power comes great responsibility