14/02/2018

[Java] Regular expression matcher

Regular expressions are a pain, until they are not. Implementing a matcher is a double pain, and possibly the hardest exercise I've seen so far in terms of proper coding.

So many rules have to be carefully prioritised for it to succeed, even without fully supporting the common keywords. We have the following syntax:
  • alphanumeric - matches exactly that character
  • . - matches any character
  • * - must follow . or an alphanumeric character, means we can match multiple instances of the preceding character
  • ^ - must be the first character, means the matching has to begin from the start
  • $ - must be the last character, means the matching has to occur at the end
In the absence of ^ and $, the match can occur anywhere in the string.



The performance of a regular expression is not easy to determine and can quickly spiral out of control for complex and carelessly constructed ones but thinking of it as being usually in order of at least O(2^N) would be a safe bet. This comes from the fact that when matching, we basically iterate over two strings, the input and the regex, trying to find at least one of the possible matches ANYWHERE in the input string, due to the existence of "choice" characters, for example '*' we have branching paths, hence the exponential complexity.

Unless ALL rules are matched, the regex would say no. But since we have special keywords such as *, we might have multiple candidates for matching the same regex for example:

a*

means 0 or more a, therefore all these are valid matches:

a
aa
aaaa

and so on. Now, if we extend that such as:

a*9

we have the following potential matches:

9
a9
aa9
aa9a
aaaa9

And so on. This means that we might be looping for each single character of the input over the regex to verify whether a match can be found from the current position or not FOR EACH position BEFORE continuing with the next portion of the pattern.

Prioritising the rules helps with writing the code, but this is a case where test driven development definitely helps A LOT.

Idea is to have a couple simple base cases:
  • null or empty string can only be matched by null or empty regex
  • null or empty regex matches any string
  • we could either parse the regex and check for validity first, or try to match it and throw an exception only if and when we encounter a mistake
Then an order of the checks:
  • if the first character is a ^, we only need to match the whole regex immediately, otherwise we have to loop over each character in the input and try to match from there
  • if we are matching on $, the rest of the matching went fine and we just need to make sure the input string is finished, otherwise THIS matching can't be done eg: a$ matches aa$ AFTER we start from the SECOND a
  • if we have a * we need for EACH position where we match it, to continue the matching of the rest of the regex AND also to consider the case where 0 matches could be made (branching decision path)

You can check my implementation of matchRegex on my Gist alongside some test cases in RegexJTests.

Here is a better approach using dynamic programming on a simpler version of the regex language.

No comments:

Post a Comment

With great power comes great responsibility