Minimize Context Switches Through Priority Assignment

A context switch is often one of the most expensive operation in an RTOS (Real Time Operating System). Hence, application developers should aim to minimize the number of context switches in order to minimize the RTOS overhead.

In a pre-emptive RTOS, both the choice of thread priority assignment and the order in which threads become available to run are key to minimizing the number of context switches that occur.

In a series of articles on context switching, William E. Lamie presents data which shows that context switching overhead is minimized when all threads are assigned the same priority. This means that by assigning the same priority to all threads, the system throughput will be maximized – i.e. the fewer the cycles that are spent on context switching overhead then the more that can be spent on application processing.

However, in real-time systems latency is often a concern. Where latency is a concern, the thread that runs the latency sensitive task should be assigned a higher priority.  This will introduce more context switches into the system which will impact throughput.

To achieve the best of both worlds, threads should be assigned an equal priority by default and only latency sensitive tasks should be assigned higher priorities.

Multi-thread scaling issues with Python and Ruby

With the advent of multi-core processors, CPU bound applications need to use multi-threading in order to be able to scale their performance beyond that offered by a single core. This provides many challenges, but an interesting aspect of this problem is to consider how the threading modules in modern programming languages such as Python and Rubycan either help or hinder this scalability. Yes, there are plenty of other programming languages in use today, but Python and especially Ruby are rapidly rising in popularity and there are some surprising limitations to be aware of when using their threading packages.

RUBY
The standard C implementation of Ruby 1.x (current version: 1.9) implements threading as green threading, where all threads are serviced by a single OS level thread and the Ruby runtime has full control over the thread life cycle. As described in the Ruby wiki, Ruby’s thread scheduler is a simple cooperative timeslicing scheduler with control switching to another thread if certain well defined keywords or events are encountered. There is also a 10ms timeout period to prevent too many context switches occurring (i.e. in general a max of 1 context switch every 10ms).

This use of green threads imposes severe scaling restrictions for Ruby applications that are CPU bound since the use of a single native OS thread limits the Ruby application to run on a single CPU core. IO bound Ruby applications can employ threading to a certain extent to parallelize waiting on IO operations but even this is limited by the 10ms minimum context switch time which has the effect of limiting the number of threads that can run within a Ruby application. Due to this limitation, scalability of Ruby applications appears to be solved today by splitting the application and running it in multiple processes which can then be run on different cores.

There is some hope in store though in that using native OS threads instead of green threads is being considered for Ruby 2.0 and there are some implementations of Ruby such as JRubywhich currently implement Ruby threads using native OS threads (via Java though for JRuby).

PYTHON
In contrast to Ruby, Python threads are implemented using native OS threads and so it is possible for different Python threads within a single application to run on different cores on a multi-core processor under the control of the OS scheduler. However, Python threading has a serious limitation in the form of the Global Interpreter Lock(GIL). This is a global lock that must be held by the current thread before it can safely access Python objects and only the thread that has acquired the global interpreter lock may operate on Python objects or call Python/C API functions. In order to support multi-threaded python programs, the interpreter regularly releases and reacquires the lock – by default, every 100 bytecode instructions. C extensions can release and reacquire the lock using the Python API and so this offers some relief, but the lock must be acquired before the state of any Python object is accessed.

Similar to Ruby, this GIL effectively limits the performance of CPU bound Python applications to that of a single CPU core (since only one Python thread can run at a time). Scalability is available for IO bound applications as these can easily scale across cores and the “one at a time” model of the GIL does not significantly restrict the performance of threads that are highly IO bound. Some relief is available by being able to implement performance and lock optimized C extensions but this is very restrictive and cumbersome – certainly a lot harder than writing some Python code.

Given this serious restriction of the Python threading model, you would expect it to be possible to replace the GIL with more fine grained locking, but apparently it has been tried and there are some reasons why we can’t get rid of the global interpreter lock. When fine grained locking was tried as a patch to Python 1.5, a 2x slowdown was observed. The slowdown was attributed to the overhead of the acquiring/releasing the OS locks. This patch hasn’t been maintained for subsequent versions of Python. Another patch that is gaining popularity and actively being maintained is python-safethread. This is a set of Python extensions that is “intended to provide safe, easy, and scalable concurrency mechanisms. It focuses on local concurrency, not distributed or parallel programs.” While it is not yet part of the Python mainline but it is certainly a promising solution to the GIL issue.

Update: Thanks to Adam Olsen for pointing me towards python-safethread as a possible solution to the GIL.

Running Functions as Threads in Python

Suppose you have a function in some Python code that you want to run as a thread. How do you do it? The simplest way is via the thread module and its start_new_thread() method. This is illustrated in the following example.

import thread

def someFunc():
    print "someFunc was called"

thread.start_new_thread(someFunc, ())

This approach has a limitation in that once the start_new_thread() function is called, it is not possible to find out when the thread has finished or to wait for completion of the thread. This may be acceptable for many applications but may be too restrictive for others. Note that function parameters can also be passed in a tuple (use an empty tuple if there are no parameters) as the second argument to start_new_thread().

Python also provides the threading module which implements a layer on top of the thread module. The threading module provides, among other things, a Thread class which contains a run() method. Typical usage is to subclass the Thread class and override the run() method in the subclass to implement the desired functionality. The Thread class also provides start() and join() methods to control the starting of a thread and to provide a mechanism for waiting until the thread has finished execution (i.e. the end of run() method is reached).

Without sub-classing, it is possible to pass a function or other callable object to the Thread class constructor to specify the target that the run() method will call. This is illustrated below.

import threading

t1 = threading.Thread(target=someFunc)
t1.start()
t1.join()

This approach works well for providing a mechanism for waiting for the thread to complete. A drawback to this approach though is that it is not possible to pass any arguments to the function supplied as the thread’s target.

If waiting for thread completion and argument passing is required it is necessary to provide a subclass of Thread. The function to execute and its arguments can be passed to the subclass constructor. This is illustrated below:

import threading

class FuncThread(threading.Thread):
    def __init__(self, target, *args):
        self._target = target
        self._args = args
        threading.Thread.__init__(self)

    def run(self):
        self._target(*self._args)

# Example usage
def someOtherFunc(data, key):
    print "someOtherFunc was called : data=%s; key=%s" % (str(data), str(key))

t1 = FuncThread(someOtherFunc, [1,2], 6)
t1.start()
t1.join()

Effective Concurrency … so far

Following on from a previous post about The Pillars of Concurrency, here are links to the full set of Herb Sutters “Effective Concurrency” column topics (as of Feb 2008) …

The Pillars of Concurrency
How Much Scalability Do You Have or Need?
Use Critical Sections (Preferably Locks) to Eliminate Races
Apply Critical Sections Consistently Avoid Calling Unknown Code While Inside a Critical Section
Use Lock Hierarchies to Avoid Deadlock
Break Amdahl’s Law!
Going Superlinear

These are an excellent set of articles on the subject of concurrency and programming for multi-core that analyzes the many facets of the subject – from background theory, to locking and on to achieving scalability and performance.

Pillars of Concurrency

Excellent article on decomposing and categorizing concurrency traits by Herb Sutter … http://www.ddj.com/dept/cpp/200001985

The three “pillars” identified in the article:

  1. Responsiveness and Isolation Via Asynchronous Agents
  2. Throughput and Scalability Via Concurrent Collections
  3. Consistency Via Safely Shared Resources

An interesting reference from this article is to an earlier article from Herb illustrating why lock based programming is hard and insufficient: http://www.ddj.com/dept/cpp/184401930