Regular Expressions for Counting Lines of Code

One of the applications that I implement as a means of gaining experience in different programming languages is an application to generate code metrics. I implement the application in the chosen language and use it to count comments, lines of code etc. in source files of a variety of programming languages. Implementing this application is a useful learning exercise since it provides exposure to a variety of language features such as file handling, regular expressions and also csv and html generation (for presenting the output statistics). Correctly using regular expressions for finding source code comments right can be a tricky thing and this is the focus of this post. A Ruby based implementation of this application, called countloc, is hosted on RubyForge for reference.

A basic set of code metrics include:

  • lines of code 
  • comment lines
  • blank lines
  • total lines

For comment lines, there are three types of comments to worry about:

  • single line comments
  • multi-line comments
  • mixed comments and code in a single line

For multiline comments, it is necessary to detect the beginning and the end of the comment. For a mixture of comments and code on a single line, it is necessary to distinguish between lines containing only comments so that the “lines of code” count can be incremented. The set of regular expressions used by countloc for finding comments in Ruby, Python, Java, C++, C, C# code is shown below. Note that languages such as C, C++, C# and Java all share similar commenting styles and they are covered under the cplusplus category. Note also that escaping is required for encoding the //, /* and */ characters for C++ comments in Ruby regular expressions.

COMMENT_PATTERNS = {
  :ruby => {
    :single_line_full => /^\s*#/,
    :single_line_mixed => /#/,
    :multiline_begin => /=begin(\s|$)/,
    :multiline_end => /=end(\s|$)/,
    :blank_line => /^\s*$/
  },

  :python => {
    :single_line_full => /^\s*#/,
    :single_line_mixed => /#/,
    :multiline_begin => /^\s*"""/,
    :multiline_end => /"""/,
    :blank_line => /^\s*$/
  },

  :cplusplus => {
    :single_line_full => /^\s*\/\//, 
    :single_line_mixed => /\/\//,
    :multiline_begin => /\/\*/,
    :multiline_end => /\*\/\s*$/,
    :multiline_begin_mixed => /^[^\s]+.*\/\*/,
    :multiline_end_mixed => /\*\/\s*[^\s]+$/,
    :blank_line => /^\s*$/
  }
}

As each line of source code is read, the above regular expressions are evaluated in the following order with the convention that as soon as a match is found, no further regular expressions need to be evaluated … unless otherwise noted.

  1. multiline_end_mixed (if within a multiline comment) 
  2. multiline_end (if within a multiline comment)
  3. multiline_begin … then check for multiline_begin_mixed and multiline_end (to check for a complete  multiline comment on a single line!)
  4. single_line_full
  5. single_line_mixed
  6. blank_line

Based on the countloc implementation and the code samples used to test it, this order appears to be sufficient. However, I sometimes find a corner case not covered by the regular expressions or evaluation order given above, so I am interested in any feedback with suggested improvements or optimizations.

C++ Regular Expression Performance

A previous post gave some examples to illustrate how to use regular expressions as defined by the upcoming C++0x standard. The new regex features are certainly easy to use but what about the performance of some of the current implementations?

To find out, I compared the performance of the Micirosoft Visual Studio 2008 implementation against the regex implementation from the Boost library v1.36, and also against the PCRE v7.7 library. Microsoft introduced C++0x regex support in Visual Studio 2008 feature pack #1 and subsequently improved regex performance in Visual Studio 2008 service pack #1. Both the fp1 and sp1 implementations were included in the comparison analysis. The Boost library was chosen for comparison since this was the initial C++ regex implementation that was the basis for the proposal for inclusion into C++0x. PCRE was chosen for comparison since this is a very popular and in many respects the defacto regular expression library for use in C/C++ applications.

Methodology
The regular expression patterns and data strings to use for comparison were taken from here. Three patterns of various complexity were selected and each of these was run against a set of five content strings. Each pattern/string search combination was run for 10000 iterations and the elapsed time was measured for each combination. The code used to perform the test is available here. Multiple runs were conducted to ensure that the results were consistent across each run. Once all the data was gathered, the results were normalized against libPcre – hence all the graphs show libPcre at a value of 1. By taking this normalization approach it is easier to compare the relative performance of each implementation.

ID Patterns Content Strings
1 ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*) http://www.linux.com/
2 usd [+-]?[0-9]+.[0-9][0-9] http://www.thelinuxshow.com/main.php3
3 \b(\w+)(\s+\1)+\b usd 1234.00
4   he said she said he said no
5   same same same

Results
The results are shown for each of the three patterns in the graphs below – one per pattern. The numbers along x-axis in each graph corresponds to the strings numbered 1-5 from the table above.

regex_bench_pattern1

Regex Benchmark - Pattern #1

regex_bench_pattern2

RegEx Benchmark - Pattern #2

regex_bench_pattern3

RegEx Benchmark - Pattern #3

Based on the data shown in the graphs, a number of observations can be made:

  1. libPCRE v7.7 has the best performance for all pattern / string combinations.
  2. Performance varies significantly for different patterns. For example, for pattern #1, the boost implementation was faster than either of the Visual Studio 2008 (vc9) versions, but was slower for patterns #2 and #3. This seems to suggest that the Visual Studio 2008 implementations show relatively poor performance for patterns that contain alterations and a number of groups and nested groups.
  3. Based on patterns #1 and #2 the Visual C++ regex implementation performance has improved as claimed between feature pack 1 (fp1) and service pack 1 (sp1) but this is not completely true as can be seen from the pattern #3 results.

Conclusion
While the C++0x regex library makes it easier to use regular expressions in C++ code, there are still some performance gaps between the Visual Studio 2008 implementation and the performance of existing regular expression libraries such as libPCRE.

Note: The data shown in this article is based on a small sample of regular expression patterns and content strings and so the results should be taken in that context.

Regular Expressions in C++

Up to now, regular expression support in C/C++ programs was achieved using third party or open source regular expression libraries such as the PCRE library. With the addition of regex support to the C++ standard library as part of the C++0x standard update, using regular expressions in C++ programs has become much simpler. This feature is included in the TR1 draft reportwhich has already been implemented in some popular compilers such as gcc and Visual Studio 2008 (as part of service pack 1).  

Six regular expression grammars will be supported in C++0x. The default is based upon the ECMAScript grammar specified in ECMA-262. This syntax is based upon the PCRE syntax and is used by languages such as Perl, Python and Ruby which also provide built in regular expression support. Other supported grammars include the POSIX regex syntax, and the syntaxes used in tools such as awk, grep and egrep.

Here are some examples that illustrate how to perform some basic tasks with the new C++ regex component of the standard library.

Header files and namespaces:

#include <regex>

using namespace std::tr1;

Finding a match:

regex rgx("ello");
assert(regex_search("Hello World", rgx));

The above example illustrates the construction of a regex object, with the regex pattern being passed as a parameter to the regex constructor. The regex object is a specialization of the basic_regex template for working with regular expressions which are provided using sequences of chars. The regex_search()function template is then used to see if the “Hello world” string contains the “ello” pattern. This function returns true as soon as the first matching substring is found. The regex_search()function is also overloaded to provide versions that take sequence iterators as params (instead of a full string) and also versions that provide additional info on the match results.

Note: The use of assert() in the examples is used to highlight the “contract” provided by the api – e.g. to highlight if a function can be used in a conditional expression and if the function should return true or false for the particular example.

Finding an exact match:
The regex_match() function template is an alternative to regex_search() and is used when the target sequence must exactly match the regular expression.

regex rgx("ello");
assert(regex_match("Hello World", rgx) == false);
assert(regex_match("ello", rgx));

Finding the position of a match:
The sub_match or match_results template is used to receive search results from regex_search(). When searching char data, the library provides a ready made specialization of match_results called cmatch.

regex rgx("llo");
cmatch result;
regex_search("Hello World", result, rgx);
cout << "Matched \"" << result.str()
    << "\" after \"" << result.prefix()
    << "\" at offset: " << result.position()
    << " with length: " << result.length()
    << endl;

Working with capture groups:
Capture groups provide a means for capturing matched regions within a regular expression. Each captured region is represented by a sub_match template object. The smatch specialization of match_results is provided by the library for working with sequences of string sub-matches.

string seq = "foo@helloworld.com";
regex rgx("(.*)@(.*)");
smatch result;
regex_search(seq, result, rgx);
for(size_t i=0; i<result.size(); ++i)
{
    cout << result[i] << endl;
}

Case insensitive searches:

regex rgx("ello", regex_constants::icase);
assert(regex_search("HELLO WORLD", rgx));