Keeping regular expressions readable

It’s easy for regular expressions to become one long string, involving lots of (round/square) brackets, backslashes, and other random symbols. It’s concise, yet often at cost of readability. We’ve recently been using a pattern that works well for us, where we’ve broken the expression down into its important constituents, still defined as a single string, but with a brief explanation for each part. Here’s a simple example:

Instead of public static final String SPECIAL_PATTERN = "(\\w)+-([a-zA-Z0-9])+-([a-zA-Z])+/\\w+" we modify the declaration slightly so it’s

public static final String SPECIAL_PATTERN
  = "(\\w)+" // at least 1 letter, number or underscore
  + "-([a-zA-Z])+" // dash with at least one number or letter 
  + "-([a-zA-Z])+" // dash with at least one number or letter 
  + "-(\\w)+" // dash with at least one letter, number or underscore

In our situation, applying Extract Constant or Extract Variable I think would have reduced readability so is a nice tradeoff of conciseness with readability.

5 comments

  1. Vivek Haridas

    I found that keeping a sample expected matching text commented along with the pattern helped to recognize them faster.
    And, if there are special cases to match due to which a pattern text looks complicated, it helps to have an example of all of them as comments.

    Also, in the future, as soon as the pattern changes, it helps in quickly recognizing the before and after scenario while making the subtle changes.

    Taking the idea to an extreme, probably, you could have an array of example texts built in the class which a unit test could look up and assert on. Though, its not wise to keep the test data inside the class, in this case it seems relevant for readability & also acts as a safety net.

  2. Patrick

    Vivek – We actually have an expanding set of examples for match and non match in unit tests. It’s not quite as close to the production code as you write, though I think it’s clear enough.

    Nat – I had a look at it and I think it looks quite readable as well. A shame that we’re not using Hamcrest on my current project (yet!)

  3. Wee

    Pat,

    How about change comments into names refactoring? Now we have readability and reusability with no duplicated comment. 🙂

    public static final String AT_LEAST_1_LETTER_OR_NUMBER_OR_UNDERSCORE = “(\\\\w)+”;
    public static final String DASH_WITH_AT_LEAST_1_NUMBER_OR_LETTER = “-([a-zA-Z])+”;
    public static final String SPECIAL_PATTERN =
    AT_LEAST_1_LETTER_OR_NUMBER_OR_UNDERSCORE +
    DASH_WITH_AT_LEAST_1_NUMBER_OR_LETTER +
    DASH_WITH_AT_LEAST_1_NUMBER_OR_LETTER +
    AT_LEAST_1_LETTER_OR_NUMBER_OR_UNDERSCORE;

  4. Patrick

    Wee,

    I thought about comments into names. Unfortunately, the caps and underscores, and plethora of other constants I think ruins the readability in the particular usage that we had. We considered that alternative but thought it reduced readability. If we had more expressions, I’d consider it again, though I’d probably push them out to extract class instead and maybe end up with something like Nat did.

    Thanks for the thoughts though!

Leave a Reply