1. Overview

In this quick tutorial, we’ll show how the pattern-matching engine works. We’ll also present different ways to optimize regular expressions in Java.

For an introduction to the use of regular expressions, please refer to this article here.

2. The Pattern-Matching Engine

The java.util.regex package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton (NFA). It’s considered nondeterministic because while trying to match a regular expression on a given string, each character in the input might be checked several times against different parts of the regular expression.

In the background, the engine mentioned above uses backtracking. This general algorithm tries to exhaust all possibilities until it declares failure. Consider the following example to better understand the NFA:

"tra(vel|ce|de)m"

With the input String “travel“, the engine first will look for “tra” and find it immediately.

After that, it’ll try to match “vel” starting from the fourth character. This will match, so it will go forward and try to match “m“.

That won’t match, and for that reason, it’ll go back to the fourth character and search for “ce“. Again, this won’t match, so it’ll go back again to the fourth position and try with “de“. That string won’t match either, and so it’ll go back to the second character in the input string and try to search for another “tra“.

With the last failure, the algorithm will return failure.

With the simple last example, the engine had to backtrack several times while trying to match the input String to the regular expression. Because of that, it’s important to minimize the amount of backtracking that it does.

3. Ways to Optimize Regular Expressions

3.1. Avoid Recompilation

Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process.

Each time we invoke the String.matches(String regex) method, the specified regular expression is recompiled:

if (input.matches(regexPattern)) {
    // do something
}

As we can see, every time the condition is evaluated, the regex expression is compiled.

To optimize, it’s possible to compile the pattern first and then create a Matcher to find the coincidences in the value:

Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
    Matcher matcher = pattern.matcher(value);
    if (matcher.matches()) {
        // do something
    }
}

An alternative to the above optimization is using the same Matcher instance with its reset() method:

Pattern pattern = Pattern.compile(regexPattern);
Matcher matcher = pattern.matcher("");
for(String value : values) {
    matcher.reset(value);
    if (matcher.matches()) {
      // do something
    }
}

Due to the fact of Matcher isn’t thread safe, we have to be cautious with the use of this variation. It could be likely dangerous in multi-threaded scenarios.

To summarize, in every situation where we’re sure there’s only one user of the Matcher at any point in time, it’s OK to reuse it with reset. For the rest, reusing the precompiled it’s enough.

3.2. Working with Alternation

As we just checked in the last section, the inadequate use of alternations could be harmful to the performance. It’s important to place options that are more likely to happen in the front so they can be matched faster.

Also, we have to extract common patterns between them. It isn’t the same to put:

(travel | trade | trace)

Than:

tra(vel | de | ce)

The latter is faster because the NFA will try to match “tra” and won’t try any of the alternatives if it doesn’t find it.

3.3. Capturing Groups

Each time we’re capturing groups, we’re incurring in a small-time penalty.

If we don’t need to capture the text inside a group, we should consider the use of non-capturing groups. Instead of use “*(M)“, please use “(?:M)*“.

4. Conclusion

In this quick article, we briefly revisited how NFA works. We then proceeded to explore how to optimize the performance of our regular expressions by pre-compiling our patterns and reuse a Matcher.

Finally, we pointed out a couple of considerations to keep in mind while we work with alternations and groups.

As usual, the full source code can be found over on GitHub.


» 下一篇: JSON指针概述