Optimising Java Regular Expressions

Yesterday I had a bit of interesting work performancing tuning some code for our project’s current iteration. The code that required optimisation basically filtered some results in memory based on a set of user-specified criteria. The criteria contained a variety of different datatypes (boolean, numeric and string) and my suspicions were confirmed after a bit of simplistic profiling that the biggest consumer of computation time was the string pattern matching that was being performed.

So here are a few tips for optimising Regular Expression usage in the Java langauge:

Minimise the number of times a regular expression is compiled (preferrably statically at source compilation time):
Take this code as an example and note that the pattern should be moved outside of the loop:
    public List getMatchingValues(        String [] extremelyLongListOfValuesForComparison, String stringToMatchOn) {         List listOfMatchingStrings = new ArrayList(extremelyLongListOfValuesForComparison.length);             for (int i = 0; i < extremelyLongListOfValuesForComparison.length; i++ ) {                 // !!! This pattern should be initialised outside of the loop                 Pattern pattern = Pattern.compile(stringToMatchOn + ".*");                 if (pattern.matcher(extremelyLongListOfValuesForComparison[i]).matches()) {                     listOfMatchingStrings.add(extremelyLongListOfValuesForComparison[i]);                 }             }         return listOfMatchingStrings;     }</>
Be aware of libraries and the JDK where convenience methods compile their own Pattern instances.
Many useful java.lang.String operations in the JDK use it for example (and is well documented). These include matches, replaceAll, replaceFirst, and split.
Prefer Matcher.find() over Matcher.matches() if you can.
Think of the situation where you are looking for two words that appear in a sentence in order. On initial thought, a regular expression like, ".*two.*words.*", might jump to mind, of which you would instantiate a matcher and call the matches() operation. We actually found that simplifying the regular expression to "two.*words", creating a matcher and then calling the find() operation increased performance significantly.
Construct your Regular Expressions carefully when you need to ignore whitespace.
For the above scenario, a user could enter several words to find any matches (in order) against a given string. Our initial code in constructing the string for the pattern was something like:

// replace any space with a wildcard match on any string in between
searchPatternString = searchTermString.replace(" ", ".*");

After a bit of testing though, a string like "find  me" (note the two spaces) resulted in a pattern that would have looked like "find.*.*me". One might expect that when the pattern was compiled, it could have been internally optimised to an equivalent pattern "find.*me" but apparently did not. A better revised pattern was thus developed using "\s*" (any and all white space characters) and of course, rule number 1 was applied afterwards.