C++ Regular Expression Performance

A previous post gave some examples to illustrate how to use regular expressions as defined by the upcoming C++0x standard. The new regex features are certainly easy to use but what about the performance of some of the current implementations?

To find out, I compared the performance of the Micirosoft Visual Studio 2008 implementation against the regex implementation from the Boost library v1.36, and also against the PCRE v7.7 library. Microsoft introduced C++0x regex support in Visual Studio 2008 feature pack #1 and subsequently improved regex performance in Visual Studio 2008 service pack #1. Both the fp1 and sp1 implementations were included in the comparison analysis. The Boost library was chosen for comparison since this was the initial C++ regex implementation that was the basis for the proposal for inclusion into C++0x. PCRE was chosen for comparison since this is a very popular and in many respects the defacto regular expression library for use in C/C++ applications.

Methodology
The regular expression patterns and data strings to use for comparison were taken from here. Three patterns of various complexity were selected and each of these was run against a set of five content strings. Each pattern/string search combination was run for 10000 iterations and the elapsed time was measured for each combination. The code used to perform the test is available here. Multiple runs were conducted to ensure that the results were consistent across each run. Once all the data was gathered, the results were normalized against libPcre – hence all the graphs show libPcre at a value of 1. By taking this normalization approach it is easier to compare the relative performance of each implementation.

ID Patterns Content Strings
1 ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*) http://www.linux.com/
2 usd [+-]?[0-9]+.[0-9][0-9] http://www.thelinuxshow.com/main.php3
3 \b(\w+)(\s+\1)+\b usd 1234.00
4   he said she said he said no
5   same same same

Results
The results are shown for each of the three patterns in the graphs below – one per pattern. The numbers along x-axis in each graph corresponds to the strings numbered 1-5 from the table above.

regex_bench_pattern1

Regex Benchmark - Pattern #1

regex_bench_pattern2

RegEx Benchmark - Pattern #2

regex_bench_pattern3

RegEx Benchmark - Pattern #3

Based on the data shown in the graphs, a number of observations can be made:

  1. libPCRE v7.7 has the best performance for all pattern / string combinations.
  2. Performance varies significantly for different patterns. For example, for pattern #1, the boost implementation was faster than either of the Visual Studio 2008 (vc9) versions, but was slower for patterns #2 and #3. This seems to suggest that the Visual Studio 2008 implementations show relatively poor performance for patterns that contain alterations and a number of groups and nested groups.
  3. Based on patterns #1 and #2 the Visual C++ regex implementation performance has improved as claimed between feature pack 1 (fp1) and service pack 1 (sp1) but this is not completely true as can be seen from the pattern #3 results.

Conclusion
While the C++0x regex library makes it easier to use regular expressions in C++ code, there are still some performance gaps between the Visual Studio 2008 implementation and the performance of existing regular expression libraries such as libPCRE.

Note: The data shown in this article is based on a small sample of regular expression patterns and content strings and so the results should be taken in that context.