Regular expression (regex) performance: The fundamental guide

in programming

There is a good chance that you are touching regular expressions from time to time if you are in any way involved with IT operations or development. They are not easy to learn and successfully matching a string with a regex can be a very rewarding feeling after a long time of debugging.

Once you are feeling familiar and experienced enough, you should take the next step and learn how to write effective and fast regular expressions that do not suddenly degrade in performance.

There are many ways to optimize regular expressions and many depend on the input you are running the expression against. This post aims to arm you with the most important techniques.

I am expecting that you are somewhat experienced with regular expressions already. If you are not, this great site should get you started with a nice interactive tutorial.

Be specific

Tell the regex engine what you are trying to match. Avoid .* operators whenever you can because this will lead to horrible backtracking (the regex engine jumping to the end of the input and then going back character by character). Giving as much detail about the kind of characters (number, whitespace, letter?), in which sequence (next is 4 numbers, then a whitespace, then two letters) and where to expect them is key to performance.

Many people will have seen a regular expression trying to match a date before. I benchmarked (Java 8 using the regex engine from java.util) three regular expressions that search for a ISO8601 date:

Input string:

2015-12-15T07:36:25+00:00 sundaysister kernel[0]:
**** [IOBluetoothHostControllerUSBTransport][ClearFeatureInterruptEndpointHalt] --
successfully posting another read for the mInt0InterruptPipe -- 
mInterruptPipeInOutstandingIOCount = 1 -- this = 0xb800";

veryDescriptiveMatch: ^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}
descriptiveMatch: .{4}-.{2}-.{2}T.{2}:.{2}:.{2}\+.{2}:.{2}
openMatch: .*-.*-.*T.*:.*:.*\+

You might think that the openMatch type is written unrealistically bad but I have seen those in the wild and they will perfectly show how much slower unspecific regular expressions are.

Let’s look at the benchmark results:

# Run complete. Total time: 00:03:53

Benchmark                                Mode  Cnt   Score   Error  Units
DateRegexBenchmark.descriptiveMatch      avgt   50   0.237 ± 0.003  us/op
DateRegexBenchmark.openMatch             avgt   50  32.325 ± 0.361  us/op
DateRegexBenchmark.veryDescriptiveMatch  avgt   50   0.190 ± 0.002  us/op

1-M91Liw6M044WmRxsSPgnaA

By telling the regular expression engine very specifically what we are searching for we are not only getting more reliable results but you also see a huge performance improvement.

If you can be specific, be specific.

Lazy quantifiers

Replacing greedy quantifiers () with lazy quantifiers (?) can improve performance, in many cases without changing the result. A greedy quantifier tells the regex engine to go to the end of the input and then backtrack until it finds what you are looking for. (which can be slow) Changing it to a lazy quantifier will make the regex engine read from the beginning of the string until it finds something.

You might already smell a caveat here: If the string you are trying to match is closer to the end of the string, a lazy quantifier might actually hurt performance.

The following benchmark showcases this behavior:

Input strings:

"early":
  at=info status=302 method=GET
  path="/repo/debian/dists/trusty/1.2/binary-amd64/Packages.gz"
  host=packages.graylog2.org request_id=f9de9767-2aa1-4e64-8d82-36f8ace006e1
  fwd="54.215.46.35" dyno=web.1 connect=0ms service=1ms
  bytes=287

"late":
  at=info method=GET path="/repo/debian/dists/trusty/1.2/binary-amd64/Packages.gz"
  host=packages.graylog2.org request_id=f9de9767-2aa1-4e64-8d82-36f8ace006e1
  fwd="54.215.46.35" dyno=web.1 connect=0ms service=1ms status=302 bytes=287
  
Regular expressions:

"greedy":
  .*status=(\d{3}).*
  
"lazy":
  .*?status=(\d{3}).*

The two input strings have the status=302 that we want to match at different positions: One has it closer to the beginning and one has it closer to the end of the string.

# Run complete. Total time: 00:05:07

Benchmark                                 Mode  Cnt  Score   Error  Units
LazyQuantifierBenchmark.greedyMatchEarly  avgt   50  2.432 ± 0.053  us/op
LazyQuantifierBenchmark.greedyMatchLate   avgt   50  0.788 ± 0.045  us/op
LazyQuantifierBenchmark.lazyMatchEarly    avgt   50  0.623 ± 0.003  us/op
LazyQuantifierBenchmark.lazyMatchLate     avgt   50  2.281 ± 0.154  us/op

1-CxBGztdh6dZKeIs_zbkHmQ

You see that the performance changes drastically, depending on where the string we try to match is located in the input. The greedy quantifier regex is slow with an early match and faster with a late match while the lazy quantifier regex is slower with a late match than when matching earlier in the string.

Benchmark, benchmark, benchmark

Like always in computering, you should never make any change without measuring the result. Especially regex tips are highly dependent on the input string or even if a matched part is closer to the beginning or end of it like we witnessed above.

Given the complexity of regular expressions, there is no way to be sure that a change makes a regular expression faster or that this particular change is the best way to attack the problem. The only solution for this is to benchmark the performance.

If you are benchmarking Java code, I recommend taking a look at the Java Microbenchmark Harness which is easy to get started with and a pleasure to use. This is a great blog post giving a good overview.

Here is the JMH code that was used to run a benchmark showcased in this post:


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class DateRegexBenchmark {

    private static final String MESSAGE = "2015-12-15T07:36:25+00:00 sundaysister ...";

    private static Pattern VERY_DESCRIPTIVE;
    private static Pattern DESCRIPTIVE;
    private static Pattern OPEN;

    @Setup
    public void prepare() {
        VERY_DESCRIPTIVE = Pattern.compile("^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\");
        DESCRIPTIVE = Pattern.compile(".{4}-.{2}-.{2}T.{2}:.{2}:.{2}\\+.{2}:.{2}");
        OPEN = Pattern.compile(".*-.*-.*T.*:.*:.*\\+");
    }

    @Benchmark
    public void veryDescriptiveMatch(Blackhole bh) {
        bh.consume(VERY_DESCRIPTIVE.matcher(MESSAGE).matches());
    }

    @Benchmark
    public void descriptiveMatch(Blackhole bh) {
        bh.consume(DESCRIPTIVE.matcher(MESSAGE).matches());
    }

    @Benchmark
    public void openMatch(Blackhole bh) {
        bh.consume(OPEN.matcher(MESSAGE).matches());
    }

}

Benchmark with different inputs

Make sure to benchmark your regular expressions with different strings as inputs. Look at this example of a regular expression that ends in catastrophic backtracking with a certain string input:

Regular expression: (a+)+Z

"unexpectedInput1": aaaaaaaaaaaaaaaaaaaaaa
"unexpectedInput2": aaaaaaaaaaaaaaaaaaaaaaa
"expectedInput":    aaaaaaaaaaaaaaaaaaaaaaaZ
# Run complete. Total time: 00:02:35

Benchmark                                  Mode  Cnt       Score       Error  Units
UnexpectedInputBenchmark.expectedInput     avgt   50       0.186 ±     0.034  us/op
UnexpectedInputBenchmark.unexpectedInput1  avgt   50  246381.032 ± 23649.682  us/op
UnexpectedInputBenchmark.unexpectedInput2  avgt   50  453904.349 ± 23412.333  us/op

The regular expression expects the input string to end with the character ‘Z’ and performs well under those circumstances. If the string does however not end with 'Z' you can see the compute time growing exponentially in relation to the input string length.

1-xUlX9Z39GN8YdAcUpNIqFA

This is catastrophic backtracking in action where unexpected input can ruin your application performance. Attackers can even use this for Denial of Service (DoS) attacks on your systems if your regular expressions are vulnerable. This type of attack is called ReDoS.

A reminder that things like this are happening and to run regex benchmarks with unexpected or even possibly malicious payloads if possible.

Continuously run the benchmarks

Regular expressions are hard. They are hard to read and even harder to read if you did not write them yourself or if you are not very familiar with the problem they are trying to solve. It is very easy to change a regex in a way that makes it fail catastrophically with certain inputs or that simply just ruins the performance.

A system that runs tens of thousands of regular expressions a second can terribly degrade in performance after a change that only makes a difference in the microsecond range for a single run.

Just like unit tests you should run benchmarks of your regular expressions for every new revision of your code and get notified about the results. Tools like Jenkins or Travis CI are very suitable for this. (In practice you would probably run and test larger parts of the business logic and not isolated regular expressions)

Try out other regex engines

In the end a regex engine is nothing else but a library and there is a high chance that your favorite programming language has many engines available. For example, this blog post compares Java regex libraries.

But be careful: Regex engines might work differently and give you unexpected results. Something that matched before might no longer match and vice versa. Don't just replace a regex engine without a good test coverage of the affected parts.

Do you really need a regex?

Let's not fool ourselves: Regular expressions are slow. Evaluate carefully if there might be a reasonable alternative. Classic string operations can be multiple orders of magnitude faster and for example Java has Guava methods that allow you to build complex string manipulations in a DSL-like and readable structure.

Remember that trying any of this also only makes sense if you are benchmarking and comparing the results.

Comments