22/07/2021

[Java] Simple regex matcher

Given a simple regex that allows characters, dots and stars, find if a given text matches a given pattern.

dot '.' = any character

star '*' = 0 or more repetitions of previous character

We have already seen a slightly more complex regular expression matcher which does a similar thing but I realized that matcher did NOT use caching.

The complexity becomes exponential due to the behavior of star, since "0 or more" means we have a choice: match on this char and continue with same pattern or keep this char and continue with next piece of pattern.

For example:

text = abbbb

pattern = ab*b

Due to this possible choice, our complexity is O(2^N) for both time and space.

We can cache however the pairs (index of text char, index of pattern char) where failed to find a match, so that for repeated patterns example:

text = abbbb

pattern = ab*b*b*.*.*b

When we reach a portion that we already investigated in the past, we skip it and continue

This uses O(N^2) space and O(N^2) time as we check each pair once. 

We have to be very careful in our base cases:

text = "", pattern = "" is a match

text = "", pattern = ".*" or any char followed by star or any sequence of chars followed by stars (.*b*n*), is a match

text = "", pattern = "." or any sequence, is not a match

text = anything, pattern = "", is not a match

We therefore find the following cases:

  • pattern is not a star: try matching with text and continue if successful
  • pattern is a star: branch options (1) match on this char and continue on next char with same pattern or (2) stay on this char and continue on next char in pattern

If we encounter a failure when doing any of these inspections, we cache it

If at any point we can make a successful match, we stop and return true.

You can check my implementation of regexMatch on my Gist along with some tests in RegexMatchJTests.


No comments:

Post a Comment

With great power comes great responsibility