Minimize Context Switches Through Priority Assignment

A context switch is often one of the most expensive operation in an RTOS (Real Time Operating System). Hence, application developers should aim to minimize the number of context switches in order to minimize the RTOS overhead.

In a pre-emptive RTOS, both the choice of thread priority assignment and the order in which threads become available to run are key to minimizing the number of context switches that occur.

In a series of articles on context switching, William E. Lamie presents data which shows that context switching overhead is minimized when all threads are assigned the same priority. This means that by assigning the same priority to all threads, the system throughput will be maximized – i.e. the fewer the cycles that are spent on context switching overhead then the more that can be spent on application processing.

However, in real-time systems latency is often a concern. Where latency is a concern, the thread that runs the latency sensitive task should be assigned a higher priority.  This will introduce more context switches into the system which will impact throughput.

To achieve the best of both worlds, threads should be assigned an equal priority by default and only latency sensitive tasks should be assigned higher priorities.

UML State Machines

An excellent series of articles on UML state machines and on state machine theory in general …

In particular, I like the practical examples and discussion of implementation options for various aspects of the notation.

This series of articles is taken from the book: Practical UML State Charts In C/C++, Second Edition by Miro Samek.

Regular Expressions for Counting Lines of Code

One of the applications that I implement as a means of gaining experience in different programming languages is an application to generate code metrics. I implement the application in the chosen language and use it to count comments, lines of code etc. in source files of a variety of programming languages. Implementing this application is a useful learning exercise since it provides exposure to a variety of language features such as file handling, regular expressions and also csv and html generation (for presenting the output statistics). Correctly using regular expressions for finding source code comments right can be a tricky thing and this is the focus of this post. A Ruby based implementation of this application, called countloc, is hosted on RubyForge for reference.

A basic set of code metrics include:

  • lines of code 
  • comment lines
  • blank lines
  • total lines

For comment lines, there are three types of comments to worry about:

  • single line comments
  • multi-line comments
  • mixed comments and code in a single line

For multiline comments, it is necessary to detect the beginning and the end of the comment. For a mixture of comments and code on a single line, it is necessary to distinguish between lines containing only comments so that the “lines of code” count can be incremented. The set of regular expressions used by countloc for finding comments in Ruby, Python, Java, C++, C, C# code is shown below. Note that languages such as C, C++, C# and Java all share similar commenting styles and they are covered under the cplusplus category. Note also that escaping is required for encoding the //, /* and */ characters for C++ comments in Ruby regular expressions.

  :ruby => {
    :single_line_full => /^\s*#/,
    :single_line_mixed => /#/,
    :multiline_begin => /=begin(\s|$)/,
    :multiline_end => /=end(\s|$)/,
    :blank_line => /^\s*$/

  :python => {
    :single_line_full => /^\s*#/,
    :single_line_mixed => /#/,
    :multiline_begin => /^\s*"""/,
    :multiline_end => /"""/,
    :blank_line => /^\s*$/

  :cplusplus => {
    :single_line_full => /^\s*\/\//, 
    :single_line_mixed => /\/\//,
    :multiline_begin => /\/\*/,
    :multiline_end => /\*\/\s*$/,
    :multiline_begin_mixed => /^[^\s]+.*\/\*/,
    :multiline_end_mixed => /\*\/\s*[^\s]+$/,
    :blank_line => /^\s*$/

As each line of source code is read, the above regular expressions are evaluated in the following order with the convention that as soon as a match is found, no further regular expressions need to be evaluated … unless otherwise noted.

  1. multiline_end_mixed (if within a multiline comment) 
  2. multiline_end (if within a multiline comment)
  3. multiline_begin … then check for multiline_begin_mixed and multiline_end (to check for a complete  multiline comment on a single line!)
  4. single_line_full
  5. single_line_mixed
  6. blank_line

Based on the countloc implementation and the code samples used to test it, this order appears to be sufficient. However, I sometimes find a corner case not covered by the regular expressions or evaluation order given above, so I am interested in any feedback with suggested improvements or optimizations.

Graphviz and the Visitor Pattern

Graphviz is an excellent and free graph visualization program that provides a simple way to generate graphs for pretty much anything with a variety of layouts. The Graphviz layout programs take descriptions of graphs in a simple text language. These descriptions can be created and edited manually but are more usually generated programatically – especially for large data sets. 

This post provides an example of using the Visitor pattern to generate a directed graph visualization of a trie using the Graphviz dot layout manager. Full python source code listing of this example is available here.

A Trie class is used to represent the trie data structure. This class is composed of a set of nodes representing the trie nodes. These nodes are anchored via a root node. The code for adding data to the trie is omitted for clarity but is available in the full source listing

class Trie:
    def __init__(self):
        self.root = Node()

    def walk(self, visitor):

From a visitor pattern perspective, the Node.walk() method is used to traverse the nodes and to pass each node encountered to the visitor object.

class Node:
    def walk(self, visitor):
        for nextNode in self.children.values():

The GraphvizVisitor class is a Visitor subclass that generates the appropriate dot language representation for each visited node.

# Visitor interface
class Visitor:
    def visitNode(self, node):

# Visitor subclass that generates Graphviz dot file contents.
class GraphvizVisitor:
    def __init__(self, dotGen):
        self.dotGen = dotGen

    def visitNode(self, node):
        if node.isOutputNode:
            dotGen.setNodeProperty(, 'shape', 'doublecircle')
        for edge, nextNode in node.children.items():
            dotGen.addLink(,, edge)

The dotGen object is an instance of the DotFileGenerator class which is used as a simple utility class to generate the dot file.

class DotFileGenerator:
    def __init__(self, name): = name
        self.outf = open(name+'.dot', 'w')
        print >> self.outf, 'digraph %s {' % (
        print >> self.outf, 'trankdir="LR";'
        print >> self.outf, 'tnode [shape = "circle"];'

    def setNodeProperty(self, nodeName, property, value):
        print >> self.outf, 't%s [%s = %s];' %(nodeName, property, value)

    def addLink(self, lhs, rhs, label):
        print >> self.outf, 't%s -> %s [label = %s];' % (lhs, rhs, label)

    def close(self):

Consider the following example to illustrate the generated output:

# Build a trie
trie = Trie()

# Generate the dot file contents by walking the trie
dotGen = DotFileGenerator('trie')
visitor = GraphvizVisitor(dotGen)

This produces the following dot file contents:

digraph trie {
	node [shape = "circle"];
	0 -> 1 [label = a];
	1 -> 2 [label = b];
	2 -> 3 [label = c];
	2 -> 10 [label = d];
	3 [shape = doublecircle];
	3 -> 7 [label = x];
	3 -> 4 [label = d];
	7 -> 8 [label = y];
	8 -> 9 [label = z];
	9 [shape = doublecircle];
	4 -> 5 [label = e];
	5 -> 6 [label = f];
	6 [shape = doublecircle];
	10 -> 11 [label = e];
	11 -> 12 [label = f];
	12 [shape = doublecircle];

After passing this dot file through Graphviz, we get:

Creating a RubyGem Package

Ruby Gems are the standard way of distributing Ruby applications and library classes. Information on RubyGems can be found at

Packaging a Ruby application or library class into a Gem package is fairly straightforward but finding documentation on it wasn’t. The Gem Spec Reference was useful, but didn’t really provide everything that was needed to build a gem.

After some searching, I came across the a more concise guide.

C++ Regular Expression Performance

A previous post gave some examples to illustrate how to use regular expressions as defined by the upcoming C++0x standard. The new regex features are certainly easy to use but what about the performance of some of the current implementations?

To find out, I compared the performance of the Micirosoft Visual Studio 2008 implementation against the regex implementation from the Boost library v1.36, and also against the PCRE v7.7 library. Microsoft introduced C++0x regex support in Visual Studio 2008 feature pack #1 and subsequently improved regex performance in Visual Studio 2008 service pack #1. Both the fp1 and sp1 implementations were included in the comparison analysis. The Boost library was chosen for comparison since this was the initial C++ regex implementation that was the basis for the proposal for inclusion into C++0x. PCRE was chosen for comparison since this is a very popular and in many respects the defacto regular expression library for use in C/C++ applications.

The regular expression patterns and data strings to use for comparison were taken from here. Three patterns of various complexity were selected and each of these was run against a set of five content strings. Each pattern/string search combination was run for 10000 iterations and the elapsed time was measured for each combination. The code used to perform the test is available here. Multiple runs were conducted to ensure that the results were consistent across each run. Once all the data was gathered, the results were normalized against libPcre – hence all the graphs show libPcre at a value of 1. By taking this normalization approach it is easier to compare the relative performance of each implementation.

ID Patterns Content Strings
1 ^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)
2 usd [+-]?[0-9]+.[0-9][0-9]
3 \b(\w+)(\s+\1)+\b usd 1234.00
4   he said she said he said no
5   same same same

The results are shown for each of the three patterns in the graphs below – one per pattern. The numbers along x-axis in each graph corresponds to the strings numbered 1-5 from the table above.


Regex Benchmark - Pattern #1


RegEx Benchmark - Pattern #2


RegEx Benchmark - Pattern #3

Based on the data shown in the graphs, a number of observations can be made:

  1. libPCRE v7.7 has the best performance for all pattern / string combinations.
  2. Performance varies significantly for different patterns. For example, for pattern #1, the boost implementation was faster than either of the Visual Studio 2008 (vc9) versions, but was slower for patterns #2 and #3. This seems to suggest that the Visual Studio 2008 implementations show relatively poor performance for patterns that contain alterations and a number of groups and nested groups.
  3. Based on patterns #1 and #2 the Visual C++ regex implementation performance has improved as claimed between feature pack 1 (fp1) and service pack 1 (sp1) but this is not completely true as can be seen from the pattern #3 results.

While the C++0x regex library makes it easier to use regular expressions in C++ code, there are still some performance gaps between the Visual Studio 2008 implementation and the performance of existing regular expression libraries such as libPCRE.

Note: The data shown in this article is based on a small sample of regular expression patterns and content strings and so the results should be taken in that context.

Multi-thread scaling issues with Python and Ruby

With the advent of multi-core processors, CPU bound applications need to use multi-threading in order to be able to scale their performance beyond that offered by a single core. This provides many challenges, but an interesting aspect of this problem is to consider how the threading modules in modern programming languages such as Python and Rubycan either help or hinder this scalability. Yes, there are plenty of other programming languages in use today, but Python and especially Ruby are rapidly rising in popularity and there are some surprising limitations to be aware of when using their threading packages.

The standard C implementation of Ruby 1.x (current version: 1.9) implements threading as green threading, where all threads are serviced by a single OS level thread and the Ruby runtime has full control over the thread life cycle. As described in the Ruby wiki, Ruby’s thread scheduler is a simple cooperative timeslicing scheduler with control switching to another thread if certain well defined keywords or events are encountered. There is also a 10ms timeout period to prevent too many context switches occurring (i.e. in general a max of 1 context switch every 10ms).

This use of green threads imposes severe scaling restrictions for Ruby applications that are CPU bound since the use of a single native OS thread limits the Ruby application to run on a single CPU core. IO bound Ruby applications can employ threading to a certain extent to parallelize waiting on IO operations but even this is limited by the 10ms minimum context switch time which has the effect of limiting the number of threads that can run within a Ruby application. Due to this limitation, scalability of Ruby applications appears to be solved today by splitting the application and running it in multiple processes which can then be run on different cores.

There is some hope in store though in that using native OS threads instead of green threads is being considered for Ruby 2.0 and there are some implementations of Ruby such as JRubywhich currently implement Ruby threads using native OS threads (via Java though for JRuby).

In contrast to Ruby, Python threads are implemented using native OS threads and so it is possible for different Python threads within a single application to run on different cores on a multi-core processor under the control of the OS scheduler. However, Python threading has a serious limitation in the form of the Global Interpreter Lock(GIL). This is a global lock that must be held by the current thread before it can safely access Python objects and only the thread that has acquired the global interpreter lock may operate on Python objects or call Python/C API functions. In order to support multi-threaded python programs, the interpreter regularly releases and reacquires the lock – by default, every 100 bytecode instructions. C extensions can release and reacquire the lock using the Python API and so this offers some relief, but the lock must be acquired before the state of any Python object is accessed.

Similar to Ruby, this GIL effectively limits the performance of CPU bound Python applications to that of a single CPU core (since only one Python thread can run at a time). Scalability is available for IO bound applications as these can easily scale across cores and the “one at a time” model of the GIL does not significantly restrict the performance of threads that are highly IO bound. Some relief is available by being able to implement performance and lock optimized C extensions but this is very restrictive and cumbersome – certainly a lot harder than writing some Python code.

Given this serious restriction of the Python threading model, you would expect it to be possible to replace the GIL with more fine grained locking, but apparently it has been tried and there are some reasons why we can’t get rid of the global interpreter lock. When fine grained locking was tried as a patch to Python 1.5, a 2x slowdown was observed. The slowdown was attributed to the overhead of the acquiring/releasing the OS locks. This patch hasn’t been maintained for subsequent versions of Python. Another patch that is gaining popularity and actively being maintained is python-safethread. This is a set of Python extensions that is “intended to provide safe, easy, and scalable concurrency mechanisms. It focuses on local concurrency, not distributed or parallel programs.” While it is not yet part of the Python mainline but it is certainly a promising solution to the GIL issue.

Update: Thanks to Adam Olsen for pointing me towards python-safethread as a possible solution to the GIL.

Regular Expressions in C++

Up to now, regular expression support in C/C++ programs was achieved using third party or open source regular expression libraries such as the PCRE library. With the addition of regex support to the C++ standard library as part of the C++0x standard update, using regular expressions in C++ programs has become much simpler. This feature is included in the TR1 draft reportwhich has already been implemented in some popular compilers such as gcc and Visual Studio 2008 (as part of service pack 1).  

Six regular expression grammars will be supported in C++0x. The default is based upon the ECMAScript grammar specified in ECMA-262. This syntax is based upon the PCRE syntax and is used by languages such as Perl, Python and Ruby which also provide built in regular expression support. Other supported grammars include the POSIX regex syntax, and the syntaxes used in tools such as awk, grep and egrep.

Here are some examples that illustrate how to perform some basic tasks with the new C++ regex component of the standard library.

Header files and namespaces:

#include <regex>

using namespace std::tr1;

Finding a match:

regex rgx("ello");
assert(regex_search("Hello World", rgx));

The above example illustrates the construction of a regex object, with the regex pattern being passed as a parameter to the regex constructor. The regex object is a specialization of the basic_regex template for working with regular expressions which are provided using sequences of chars. The regex_search()function template is then used to see if the “Hello world” string contains the “ello” pattern. This function returns true as soon as the first matching substring is found. The regex_search()function is also overloaded to provide versions that take sequence iterators as params (instead of a full string) and also versions that provide additional info on the match results.

Note: The use of assert() in the examples is used to highlight the “contract” provided by the api – e.g. to highlight if a function can be used in a conditional expression and if the function should return true or false for the particular example.

Finding an exact match:
The regex_match() function template is an alternative to regex_search() and is used when the target sequence must exactly match the regular expression.

regex rgx("ello");
assert(regex_match("Hello World", rgx) == false);
assert(regex_match("ello", rgx));

Finding the position of a match:
The sub_match or match_results template is used to receive search results from regex_search(). When searching char data, the library provides a ready made specialization of match_results called cmatch.

regex rgx("llo");
cmatch result;
regex_search("Hello World", result, rgx);
cout << "Matched \"" << result.str()
    << "\" after \"" << result.prefix()
    << "\" at offset: " << result.position()
    << " with length: " << result.length()
    << endl;

Working with capture groups:
Capture groups provide a means for capturing matched regions within a regular expression. Each captured region is represented by a sub_match template object. The smatch specialization of match_results is provided by the library for working with sequences of string sub-matches.

string seq = "";
regex rgx("(.*)@(.*)");
smatch result;
regex_search(seq, result, rgx);
for(size_t i=0; i<result.size(); ++i)
    cout << result[i] << endl;

Case insensitive searches:

regex rgx("ello", regex_constants::icase);
assert(regex_search("HELLO WORLD", rgx));

Running Functions as Threads in Python

Suppose you have a function in some Python code that you want to run as a thread. How do you do it? The simplest way is via the thread module and its start_new_thread() method. This is illustrated in the following example.

import thread

def someFunc():
    print "someFunc was called"

thread.start_new_thread(someFunc, ())

This approach has a limitation in that once the start_new_thread() function is called, it is not possible to find out when the thread has finished or to wait for completion of the thread. This may be acceptable for many applications but may be too restrictive for others. Note that function parameters can also be passed in a tuple (use an empty tuple if there are no parameters) as the second argument to start_new_thread().

Python also provides the threading module which implements a layer on top of the thread module. The threading module provides, among other things, a Thread class which contains a run() method. Typical usage is to subclass the Thread class and override the run() method in the subclass to implement the desired functionality. The Thread class also provides start() and join() methods to control the starting of a thread and to provide a mechanism for waiting until the thread has finished execution (i.e. the end of run() method is reached).

Without sub-classing, it is possible to pass a function or other callable object to the Thread class constructor to specify the target that the run() method will call. This is illustrated below.

import threading

t1 = threading.Thread(target=someFunc)

This approach works well for providing a mechanism for waiting for the thread to complete. A drawback to this approach though is that it is not possible to pass any arguments to the function supplied as the thread’s target.

If waiting for thread completion and argument passing is required it is necessary to provide a subclass of Thread. The function to execute and its arguments can be passed to the subclass constructor. This is illustrated below:

import threading

class FuncThread(threading.Thread):
    def __init__(self, target, *args):
        self._target = target
        self._args = args

    def run(self):

# Example usage
def someOtherFunc(data, key):
    print "someOtherFunc was called : data=%s; key=%s" % (str(data), str(key))

t1 = FuncThread(someOtherFunc, [1,2], 6)

Demystifying the volatile keyword

Following on from my earlier post about the restrict keyword, I’d like to try and dispell the myth around the volatile keyword. The meaning of volatile is a popular interview question, particularly for embedded development jobs and I’ve heard some programmers describe the properties of this keyword as if it had super powers.

The volatile keyword does a very simple job. When a variable is marked as volatile, the programmer is instructing the compiler not to cache this variable in a register but instead to read the value of the variable from memory each and every time the variable is used. That’s it – simple isn’t it?

To illustrate the use of the keyword, consider the following example:volatile int* vp = SOME_REGISTER_ADDRESS;
for(int i=0; i<100; i++)     foo(*vp);[/sourcecode] In this simple example, the pointer vp points to a volatile int. The value of this int is read from memory for each loop iteration. If volatile was not specified then it is likely that the compiler would generate optimized code which would read the value of the int once, temporarily store this in a register and then use the register copy during each iteration. Examples of where volatile is often used:

  • When accessing hardware registers via pointers. It is necessary for the generated code to always access the hardware registers value and never a potentially out of date copy.
  • When accessing a variable that is shared between two or more threads or between a thread and an ISR.

Common myths about the volatile keyword:

  • A volatile variable will never reside in cache memory – e.g. within the L2 cache of a processor. This is not true. In the case where the volatile variable is shared between two software threads, it is highly likely that the variable will exist in cached memory and the cache coherency policy will ensure that the threads will see the correct and up-to-date value of the variable in the event that the threads are running on seperate cores that don’t share the same cache. When the volatile variable refers to a hardware register, it is likely that the memory map of the system will be setup such that this register is not cacheable in the processor cache, but this is setup and managed by the appropriate device driver and is not something that is provided by the volatile keyword.

Thats’ it. So the next time you hear somebody waxing lyrical about the super powers of the volatile keyword, please feel empowered to enlighten them!