Post-Increment vs Pre-Increment

There is a well known C++ BKM for preferring pre-increment (++i) over post-increment (i++). The rationale is that when post-incrementing and object, the object must increment itself and then return a temporary containing its old value. While this makes perfect sense for complex types (e.g. STL iterators) I’ve always wondered if it holds true for built-in types such as int. Let’s find out.

First let’s create a trivial example to examine:

#include <iostream>

using namespace std;

int main()
{
	// post-imcrement
	for(int i=0; i<10; i++)
	{
		cout << "[" << i << "]" << endl;
	}

	// pre-imcrement
	for(int i=0; i<10; ++i)
	{
		cout << "[" << i << "]" << endl;
	}

	return 0;
}

Now using the Visual Studio 2008 C++ compiler, let’s take a look at the generated assembly code. First let’s keep things simple and disable compiler optimizations (/Od):

	// post-imcrement
	for(int i=0; i<10; i++)
004114D3  mov         dword ptr [i],0 
004114D5  jmp         main+30h (4114E0h) 
004114D7  mov         eax,dword ptr [i] 
004114DA  add         eax,1 
004114DD  mov         dword ptr [i],eax 
004114E0  cmp         dword ptr [i],0Ah 
004114E4  jge         main+86h (411536h) 
	{

<... snip ...>

	// pre-imcrement
	for(int i=0; i<10; ++i)
00411536  mov         dword ptr [i],0 
0041153D  jmp         main+98h (411548h) 
0041153F  mov         eax,dword ptr [i] 
00411542  add         eax,1 
00411545  mov         dword ptr [i],eax 
00411548  cmp         dword ptr [i],0Ah 
0041154C  jge         main+0EEh (41159Eh) 
	{

The relevant parts of the assembly code are highlighted. As you can see, there is no difference between the generated code for post-increment and pre-increment.

Now, with compiler optimizations enabled (\O2) and the relevant parts highlighted:

	// post-imcrement
	for(int i=0; i<10; i++)
00401008  xor         esi,esi 
0040100A  lea         ebx,[ebx] 
	{
<... snip ...>
0040104A  inc         esi  
0040104B  cmp         esi,0Ah 
0040104E  jl          main+10h (401010h) 
	}

	// pre-imcrement
	for(int i=0; i<10; ++i)
00401050  xor         esi,esi 
	{
<... snip ...>
0040108C  inc         esi  
0040108D  cmp         esi,0Ah 
00401090  jl          main+52h (401052h) 
	}

Again there is no difference in the generated assembly between post-increment and pre-increment.

So, based on this quick experiment, there is no difference between pre-increment and post-increment for int types when using the Visual Studio 2008 C++ compiler with optimizations enabled or disabled.

Note that this doesn’t change the BKM to prefer pre-increments over post-increments when using integer types. The above experiment was only conducted with a single compiler – other compilers may behave differently and so it is best to err on the side of caution and use pre-increments where possible. Also, the use of pre-increment adds some future proofing – if the type of the variable being incremented changed to a more complex type then pre-increment then becomes the optimal choice.

Update: STL vs Gnulib Performance

Following up on a previous post about the relative performance of STL vs gnulib, this post extends and completes the analysis to include the stl::vector and stl::deque containers. These containers provide a more direct comparison against the Gnulib array-list and carray-list implementations. The results are shown below. Please refer to the previous post on this subject for details on how the results were collected including source code.

DISCLAIMER: Please note that the intent of the post is not to suggest that gnulib should be used instead of STL in C++ programs, but rather to point out that although not part of the C language specification, gnulib offers C programmers a set of containers whose performance is comparable to the STL container performance available to C++ programmers.

Push back (insert at end) performance

Push back (insert at end) performance

Search Performance

Search (unsorted) Performance

Note that the vertical axis represents seconds and the lower the bar for each algorithm, the better.

Based on the above results, it can be seen that the performance of the stl::list is comparable to the gnulib linked_list for both insertions (at the end) and unsorted searching. Gnulib array_list is comparable in implementation to the stl::vector but proves to be almost 2x faster for insertions and ~1.5x slower for searching for this test. The stl::deque is also comparable to the gnulib array_list and proves to be slightly faster for insertions and just under 3x slower for searching.

As a final note, this analysis only performed comparisons based on two tests – push back and unsorted find. This is just an indication of performance and far from complete. The results may well be different if different tests were chosen.

STL vs Gnulib Performance

STL (Standard Template Library) adds containers and algorithms to the C++ language. There is no equivalent library in the C specification but there are a number of available libraries that provide common data structure and algorithm implementations which may be used in C programs. Once such library is gnulib which provides, amongst other things, implementations for common container data structures such as lists and trees. This post provides a quick comparison of the performance of a number of the gnulib list implementations against the performance of the STL list implementation. For this analysis, gcc v4.0.1 using -O2 optimization and the gnulib source from June 13th, 2009 were used.

The Benchmark
A very simple benchmark was chosen for the performance characterization. Two operations were measured. The first was the insertion of a large number of integers into a list, with the integers being added one at a time to the end of the list. The second was the searching for elements within the list. It is acknowledged that these are very simple tests and that any results are somewhat inconclusive in the absence of a larger and more thorough suite of performance tests.

STL list benchmarking source code:


// The insert code
start = clock();
for(uint32_t i=0; i<INSERT_ITERATIONS; ++i)
    l.push_back(i);
stop = clock();

// The search code
start = clock();
for(uint32_t i=0; i<SEARCH_ITERATIONS; ++i)
    find(l.begin(), l.end(), i);
stop = clock();

Gnulib list benchmarking source code:

// The insert code
start = clock();
for(i=0; i<INSERT_ITERATIONS; ++i)
    gl_list_add_last(list, (void*)i);
stop = clock();

// The search code
start = clock();
for(i=0; i<SEARCH_ITERATIONS; ++i)
    gl_list_search(list, (void*)i);
stop = clock();

Both STL and gnulib provide a number of container implementations. For this analysis the focus is on the list container implementations. Although gnulib provides a number of “list” implementations only the gl_linked_list container is a true linked list and the comparison between this and the STL list is the most relevant comparison. However, for illustrating the comparative performance of the other gnulib list implementations, four gnulib list implementations were selected for benchmarking:

  • gl_linked_list
  • gl_linkedhash_list
  • gl_array_list
  • gl_rbtree_list

The comparative performance of the relevant STL containers that are most similar to the gnulib list containers other than the gl_linked_list will be provided in a follow-up post.

Finally, the tests were run on a Macbook running OS X 10 with a 2.4GHz Intel Core 2 Duo processor with 3MB L2 cache and 2GB of RAM.

Results
The results for each algorithm were normalized with respect to the STL list results. So, with this normalization, a number less than one means that this is faster than the STL list implementation and a number greater than one means that this is slower than the STL list implementation.

The normalized results are shown in the graph below.

list_perf

Based on the above results, the STL list implementation and the gnulib gl_linked_list implementation are very similar in terms of performance with the gnulib implementation being very slightly faster. The gl_array_list implementation offers the best insertion performance (>5x faster than the STL implementation) and the gl_linkedhash_list implementation offers the best search performance (~5000x faster than the STL implementation).

Conclusions
The pure linked list implementations of both STL list and the gl_linked_list are almost identical in performance for both tests.

While the above results show the relative performance of the STL list against the other gnulib container implementations, this is not a fair and complete comparison as the STL offers other sequence containers that have similar implementations to some of the gnulib containers analyzed here. A follow up post will fill out the comparison started here and include STL vectors and sets into the comparison.

EDIT: The original intent of this post to compare an STL list against the full suite of gnulib list implementations was deeply flawed. The gnulib “list” interface is closer to the more generic STL sequence interface and a more correct and complete comparison would be to compare the gnulib containers against the STL containers. This will be a focus of a follow-up post. I’ve edited this post to clarify the comparison in order to avoid confusion. While I realized my mistake shortly after publishing the post, I was happy to see that this was also picked up by several readers – so many thanks to those of you who provided comments for keeping me honest!

C++ Regular Expression Performance

A previous post gave some examples to illustrate how to use regular expressions as defined by the upcoming C++0x standard. The new regex features are certainly easy to use but what about the performance of some of the current implementations?

To find out, I compared the performance of the Micirosoft Visual Studio 2008 implementation against the regex implementation from the Boost library v1.36, and also against the PCRE v7.7 library. Microsoft introduced C++0x regex support in Visual Studio 2008 feature pack #1 and subsequently improved regex performance in Visual Studio 2008 service pack #1. Both the fp1 and sp1 implementations were included in the comparison analysis. The Boost library was chosen for comparison since this was the initial C++ regex implementation that was the basis for the proposal for inclusion into C++0x. PCRE was chosen for comparison since this is a very popular and in many respects the defacto regular expression library for use in C/C++ applications.

Methodology
The regular expression patterns and data strings to use for comparison were taken from here. Three patterns of various complexity were selected and each of these was run against a set of five content strings. Each pattern/string search combination was run for 10000 iterations and the elapsed time was measured for each combination. The code used to perform the test is available here. Multiple runs were conducted to ensure that the results were consistent across each run. Once all the data was gathered, the results were normalized against libPcre – hence all the graphs show libPcre at a value of 1. By taking this normalization approach it is easier to compare the relative performance of each implementation.

ID Patterns Content Strings
1 ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*) http://www.linux.com/
2 usd [+-]?[0-9]+.[0-9][0-9] http://www.thelinuxshow.com/main.php3
3 \b(\w+)(\s+\1)+\b usd 1234.00
4   he said she said he said no
5   same same same

Results
The results are shown for each of the three patterns in the graphs below – one per pattern. The numbers along x-axis in each graph corresponds to the strings numbered 1-5 from the table above.

regex_bench_pattern1

Regex Benchmark - Pattern #1

regex_bench_pattern2

RegEx Benchmark - Pattern #2

regex_bench_pattern3

RegEx Benchmark - Pattern #3

Based on the data shown in the graphs, a number of observations can be made:

  1. libPCRE v7.7 has the best performance for all pattern / string combinations.
  2. Performance varies significantly for different patterns. For example, for pattern #1, the boost implementation was faster than either of the Visual Studio 2008 (vc9) versions, but was slower for patterns #2 and #3. This seems to suggest that the Visual Studio 2008 implementations show relatively poor performance for patterns that contain alterations and a number of groups and nested groups.
  3. Based on patterns #1 and #2 the Visual C++ regex implementation performance has improved as claimed between feature pack 1 (fp1) and service pack 1 (sp1) but this is not completely true as can be seen from the pattern #3 results.

Conclusion
While the C++0x regex library makes it easier to use regular expressions in C++ code, there are still some performance gaps between the Visual Studio 2008 implementation and the performance of existing regular expression libraries such as libPCRE.

Note: The data shown in this article is based on a small sample of regular expression patterns and content strings and so the results should be taken in that context.

Regular Expressions in C++

Up to now, regular expression support in C/C++ programs was achieved using third party or open source regular expression libraries such as the PCRE library. With the addition of regex support to the C++ standard library as part of the C++0x standard update, using regular expressions in C++ programs has become much simpler. This feature is included in the TR1 draft reportwhich has already been implemented in some popular compilers such as gcc and Visual Studio 2008 (as part of service pack 1).  

Six regular expression grammars will be supported in C++0x. The default is based upon the ECMAScript grammar specified in ECMA-262. This syntax is based upon the PCRE syntax and is used by languages such as Perl, Python and Ruby which also provide built in regular expression support. Other supported grammars include the POSIX regex syntax, and the syntaxes used in tools such as awk, grep and egrep.

Here are some examples that illustrate how to perform some basic tasks with the new C++ regex component of the standard library.

Header files and namespaces:

#include <regex>

using namespace std::tr1;

Finding a match:

regex rgx("ello");
assert(regex_search("Hello World", rgx));

The above example illustrates the construction of a regex object, with the regex pattern being passed as a parameter to the regex constructor. The regex object is a specialization of the basic_regex template for working with regular expressions which are provided using sequences of chars. The regex_search()function template is then used to see if the “Hello world” string contains the “ello” pattern. This function returns true as soon as the first matching substring is found. The regex_search()function is also overloaded to provide versions that take sequence iterators as params (instead of a full string) and also versions that provide additional info on the match results.

Note: The use of assert() in the examples is used to highlight the “contract” provided by the api – e.g. to highlight if a function can be used in a conditional expression and if the function should return true or false for the particular example.

Finding an exact match:
The regex_match() function template is an alternative to regex_search() and is used when the target sequence must exactly match the regular expression.

regex rgx("ello");
assert(regex_match("Hello World", rgx) == false);
assert(regex_match("ello", rgx));

Finding the position of a match:
The sub_match or match_results template is used to receive search results from regex_search(). When searching char data, the library provides a ready made specialization of match_results called cmatch.

regex rgx("llo");
cmatch result;
regex_search("Hello World", result, rgx);
cout << "Matched \"" << result.str()
    << "\" after \"" << result.prefix()
    << "\" at offset: " << result.position()
    << " with length: " << result.length()
    << endl;

Working with capture groups:
Capture groups provide a means for capturing matched regions within a regular expression. Each captured region is represented by a sub_match template object. The smatch specialization of match_results is provided by the library for working with sequences of string sub-matches.

string seq = "foo@helloworld.com";
regex rgx("(.*)@(.*)");
smatch result;
regex_search(seq, result, rgx);
for(size_t i=0; i<result.size(); ++i)
{
    cout << result[i] << endl;
}

Case insensitive searches:

regex rgx("ello", regex_constants::icase);
assert(regex_search("HELLO WORLD", rgx));

Demystifying the volatile keyword

Following on from my earlier post about the restrict keyword, I’d like to try and dispell the myth around the volatile keyword. The meaning of volatile is a popular interview question, particularly for embedded development jobs and I’ve heard some programmers describe the properties of this keyword as if it had super powers.

The volatile keyword does a very simple job. When a variable is marked as volatile, the programmer is instructing the compiler not to cache this variable in a register but instead to read the value of the variable from memory each and every time the variable is used. That’s it – simple isn’t it?

To illustrate the use of the keyword, consider the following example:volatile int* vp = SOME_REGISTER_ADDRESS;
for(int i=0; i<100; i++)     foo(*vp);[/sourcecode] In this simple example, the pointer vp points to a volatile int. The value of this int is read from memory for each loop iteration. If volatile was not specified then it is likely that the compiler would generate optimized code which would read the value of the int once, temporarily store this in a register and then use the register copy during each iteration. Examples of where volatile is often used:

  • When accessing hardware registers via pointers. It is necessary for the generated code to always access the hardware registers value and never a potentially out of date copy.
  • When accessing a variable that is shared between two or more threads or between a thread and an ISR.

Common myths about the volatile keyword:

  • A volatile variable will never reside in cache memory – e.g. within the L2 cache of a processor. This is not true. In the case where the volatile variable is shared between two software threads, it is highly likely that the variable will exist in cached memory and the cache coherency policy will ensure that the threads will see the correct and up-to-date value of the variable in the event that the threads are running on seperate cores that don’t share the same cache. When the volatile variable refers to a hardware register, it is likely that the memory map of the system will be setup such that this register is not cacheable in the processor cache, but this is setup and managed by the appropriate device driver and is not something that is provided by the volatile keyword.

Thats’ it. So the next time you hear somebody waxing lyrical about the super powers of the volatile keyword, please feel empowered to enlighten them!

C99 restrict Keyword

I was pleasently surprised to encounter a C keyword that I had never heard of before. The keyword in question is the restrict keyword which is a type qualifier for pointers and is a formal part of the C99 standard. This keyword allows programmers to declare that pointers which share the same type do not alias each other.  This information can then be used by the compiler to make optimizations when using the pointers. If the data is in fact aliased, the results are undefined.

Consider the following example:

memcpy((void* restrict) dst, (void* restrict) src, size);

 

This tells the compiler that neither the dst or src pointer paramters overlap and so the compiler is free to apply any optimizations – including optimizations that may result in out of order reads/writes.

Mainstream compilers have varying support for this feature.

  • GCC supports it in C99 mode  – specified via the “-std=c99” option or for non-C99 code by specifying __restrict to enable the keyword as a GCC extension.
  • Microsoft’s Visual Studio .NET 2005/2008 compiler doesn’t support this feature as specified in the C99 standard but does provide similar support using the __restrict specifier. Micorosft also allows this keyword to be specified for both C and C++ code. See the MSDN documentation for more details on Microsofts implementation and differences between it’s support and the C99 specification of restrict.

Finally, it should be noted that this keyword is specific to C and is not specified in the 1998 C++ specification nor is it currently planned for inclusion in the fothcoming C++ specification update.

References:

Thread Affinity on OS X

My first experience in developing software to run on OS X has been a disappointing one. Previously, I have written software to run on Windows and Linux and in general I have found library or system API calls that provide me with the ability to access the services that I expect to be provided by the operating system. One such service is thread affinity – i.e. the ability to tie a particular thread to a given core. This is achieved using SetThreadAffinityMask() on Windows and sched_setaffinity() on Linux.

However, I was very surprised to find that it does not appear to be possible to do this on OS X! 

OS X Leopard introduced a thread affinity API which provides the ability to provide affinity hints to the scheduler to improve data locality in caches. Unfortunately this API does not provide the ability to tie a thread to a core. This is a major gap, especially now that all new PCs and laptops are multi-core.

So why do I want to be able to tie a thread to a core? I want to benchmark a piece of code and one of the data points of interest is to benchmark this code on a single core. With the current Mac OS X thread affinity API, this is not possible!

I’m finding it hard to believe that this service does not exist on OS X – I mean surely somebody must have wanted to run a single core benchmark on OS X running on a multi-core system. But after a couple of hours of googling and reading through the OS X APIs I have been unable to find out how to do this.

Visual C++ 2008 Feature Pack Released

Previous posts have mentioned the Visual C++ 2008 feature pack which was available as a beta release. The full version has been released and is available for download

The previously reported bug with array::max_size() has now been fixed.

Some TR1 related resources from Microsoft:

C++ Lambda Functions

A previous post showed a code snippet for dumping the contents of an STL container using an ostream_iterator. Things have now gotten a little bit easier with the introduction of lambda functions into the upcoming C++0x standard. Lambda functions allow pieces of code to be passed around as if they were ordinary objects. Applying lambda functions for dumping the contents of an STL container gives:

vector v(10);
generate(v.begin(), v.end(), rand);
for_each(v.begin(), v.end(),
    [](int& x){ cout << x << " "; }) [/sourcecode] OK, so the syntax looks a bit strange at first, but it does add a powerful construct to the language. Lambdas have only just been added to the C++0x standard and support for lambdas are not included in the beta version of the Visual Studio 2008 TR1 feature pack.

For more details and examples of lambda functions, check out Herb Sutter’s recent trip report from the February/March ISO C++ standards meeting.