Cryptography with Ruby

When looking at cryptographic support in Ruby I was pleased to see the breadth of cryptographic features provided by the openssl module. However, I was also surprised to see that performing simple operations such as encrypting or hashing a string required a number of interactions with the openssl module. Furthermore, using the API correctly and securely to perform these simple operations requires some care.

The Gibberish gem provides a nice wrapper around openssl and presents a very clean and simple API for performing simple operations such as AES encryption/decryption, digest generation and HMAC generation.

For example, encrypting some text with Gibberish:

# AES encryption (defaults to AES-256-CBC
cipher = Gibberish::AES.new("p4ssw0rd")
cipher.enc("Some top secret data")

Compared to the same operation with openssl:

c = OpenSSL::Cipher::Cipher.new("aes-256-cbc")
c.encrypt
salt = generate_salt
c.pkcs5_keyivgen("p4ssw0rd", salt, 1)
e = c.update("Some top secret data") + c.final
e = "Salted__#{salt}#{e}" #OpenSSL compatible

Some of the problems related to correctly using the openssl library has arisen from poor documentation in earlier versions of Ruby. Eric Hodel has put a lot of effort into improving this documentation in the latest versions of Ruby.

Further searching for Ruby cryptographic libraries yields a pure ruby implementation called crypt and another wrapper around openssl called ezcrypto. Crypt performance is lower than openssl due to the pure ruby implementation and the last update was a good while ago. Ezcrypto also seems functional but I prefer the style of the Gibberish API.

In summary, Gibberish seems to be the best option for common cryptographic operations with openssl as the fallback for any features not exposed through the Gibberish wrapper.

Passing options in Ruby

A neat idiom for passing options to a class in Ruby …

class WithOptions
  def initialize(opts = {})
    @options = {
      :debug => false
    }.merge(opts)
  end
end
foo = WithOptions.new(:debug => true)

# or ...

other_opts = {
  :debug => true,
  :debug_level => :WARN
}
foo = WithOptions.new(other_opts)

Any options passed will be merged with the default options setup in the class. Options with the same name will override the default options. The same technique can be used with methods.

An alternative approach is to specify the defaults directly in the default options hash parameter.

def some_method(opts = {:debug => false})
  puts opts[:debug]
end

CodeKata – Kata Eighteen Solution

Kata Eighteen requires some code to calculate transitive dependencies, i.e. how dependencies propagate between things. Here is my solution.

class Dependencies

def initialize()
@direct = Hash.new
end

def add_direct(item, dependants)
@direct[item] = dependants
end

def dependencies_for(item)
dependants = []
toBeProcessed = @direct[item].clone
while not toBeProcessed.empty?
x = toBeProcessed.shift
if x != item and not dependants.include?(x)
dependants << x toBeProcessed.concat(@direct[x]) if @direct.include?(x) end end return dependants.sort! end end [/sourcecode]

CodeKata – Kata Five Solution

Kata five requires the implementation of a Bloom filter. Here is my solution in Ruby using a non-optimized bitmap implementation and a reduced MD5 digest for the hashing functions.

class BloomFilter

def initialize(dictionary, num_bits, num_hashes)
@bitmap = Bitmap.new(num_bits)
@hashes = Array.new(num_hashes) { |i| BloomHash.new(i, num_bits-1) }

# compute the hash values for each word in the dictionary
# and set the appropriate bits in the bitmap
File.new(dictionary, ‘r’).each_line do |word|
@hashes.each { |h| @bitmap[h.compute(word.chomp)] = 1 }
end
end

def search(word)
@hashes.each { |h| return false if @bitmap[h.compute(word)] == 0 }
return true
end

end

class BloomHash

def initialize(seed, max_value)
@offset = seed
@num_digits = 0
@max_value = max_value
while(max_value > 0) do
@num_digits += 1
max_value >>= 4
end
end

def compute(word)
digest = Digest::MD5.hexdigest(word)
return digest[@offset, @num_digits].to_i(16) % @max_value
end

end

class Bitmap < Array def initialize(size) super(size) self.fill(0) end end [/sourcecode]

CodeKata – Kata Two Solution

Dave Thomas maintains a blog of “Code Kata’s”. He describes a code kata as being a practice session for practicing coding skills:

Each is a short exercise (perhaps 30 minutes to an hour long). Some involve programming, and can be coded in many different ways. Some are open ended, and involve thinking about the issues behind programming. These are unlikely to have a single correct answer.

Here are two of my solutions for kata two using Ruby. The first uses recursion and the second uses an iterative approach.

Recursive solution …

# Recursive implementation
def chop(target, values)
  # Special handling for zero and single element arrays
  return -1 if values.empty?

  return ((target == values[0]) ? 0 : -1) if values.length == 1

  # Try the bottom half first
  pos = chop(target, values[0, values.length/2])
  return pos if pos != -1

  # Then the upper half ... remember that the returned 
  # position is relative to the middle of the array.
  pos = chop(target, values[values.length/2, values.length-1])
  return pos + (values.length/2) if pos != -1

  # Didn't find what we were looking for
  return -1
end

Iterative solution …

# Iterative implementation
def chop(target, values)
start = 0
stop = values.length # stop indexes one past the end of the array

# loop until something is found or until we
# run out of elements to search
while start != stop
# Find the middle
mid = start + (stop – start) / 2

# Check to see if we got lucky
return mid if values[mid] == target

# Not this time … so lets see which half it might be in
if target < values[pos] stop = mid else start = mid + 1 end end # Didn't find what we were looking for return -1 end [/sourcecode]

Regular Expressions for Counting Lines of Code

One of the applications that I implement as a means of gaining experience in different programming languages is an application to generate code metrics. I implement the application in the chosen language and use it to count comments, lines of code etc. in source files of a variety of programming languages. Implementing this application is a useful learning exercise since it provides exposure to a variety of language features such as file handling, regular expressions and also csv and html generation (for presenting the output statistics). Correctly using regular expressions for finding source code comments right can be a tricky thing and this is the focus of this post. A Ruby based implementation of this application, called countloc, is hosted on RubyForge for reference.

A basic set of code metrics include:

  • lines of code 
  • comment lines
  • blank lines
  • total lines

For comment lines, there are three types of comments to worry about:

  • single line comments
  • multi-line comments
  • mixed comments and code in a single line

For multiline comments, it is necessary to detect the beginning and the end of the comment. For a mixture of comments and code on a single line, it is necessary to distinguish between lines containing only comments so that the “lines of code” count can be incremented. The set of regular expressions used by countloc for finding comments in Ruby, Python, Java, C++, C, C# code is shown below. Note that languages such as C, C++, C# and Java all share similar commenting styles and they are covered under the cplusplus category. Note also that escaping is required for encoding the //, /* and */ characters for C++ comments in Ruby regular expressions.

COMMENT_PATTERNS = {
  :ruby => {
    :single_line_full => /^\s*#/,
    :single_line_mixed => /#/,
    :multiline_begin => /=begin(\s|$)/,
    :multiline_end => /=end(\s|$)/,
    :blank_line => /^\s*$/
  },

  :python => {
    :single_line_full => /^\s*#/,
    :single_line_mixed => /#/,
    :multiline_begin => /^\s*"""/,
    :multiline_end => /"""/,
    :blank_line => /^\s*$/
  },

  :cplusplus => {
    :single_line_full => /^\s*\/\//, 
    :single_line_mixed => /\/\//,
    :multiline_begin => /\/\*/,
    :multiline_end => /\*\/\s*$/,
    :multiline_begin_mixed => /^[^\s]+.*\/\*/,
    :multiline_end_mixed => /\*\/\s*[^\s]+$/,
    :blank_line => /^\s*$/
  }
}

As each line of source code is read, the above regular expressions are evaluated in the following order with the convention that as soon as a match is found, no further regular expressions need to be evaluated … unless otherwise noted.

  1. multiline_end_mixed (if within a multiline comment) 
  2. multiline_end (if within a multiline comment)
  3. multiline_begin … then check for multiline_begin_mixed and multiline_end (to check for a complete  multiline comment on a single line!)
  4. single_line_full
  5. single_line_mixed
  6. blank_line

Based on the countloc implementation and the code samples used to test it, this order appears to be sufficient. However, I sometimes find a corner case not covered by the regular expressions or evaluation order given above, so I am interested in any feedback with suggested improvements or optimizations.

Creating a RubyGem Package

Ruby Gems are the standard way of distributing Ruby applications and library classes. Information on RubyGems can be found at http://www.rubygems.org.

Packaging a Ruby application or library class into a Gem package is fairly straightforward but finding documentation on it wasn’t. The Gem Spec Reference was useful, but didn’t really provide everything that was needed to build a gem.

After some searching, I came across the a more concise guide.

Multi-thread scaling issues with Python and Ruby

With the advent of multi-core processors, CPU bound applications need to use multi-threading in order to be able to scale their performance beyond that offered by a single core. This provides many challenges, but an interesting aspect of this problem is to consider how the threading modules in modern programming languages such as Python and Rubycan either help or hinder this scalability. Yes, there are plenty of other programming languages in use today, but Python and especially Ruby are rapidly rising in popularity and there are some surprising limitations to be aware of when using their threading packages.

RUBY
The standard C implementation of Ruby 1.x (current version: 1.9) implements threading as green threading, where all threads are serviced by a single OS level thread and the Ruby runtime has full control over the thread life cycle. As described in the Ruby wiki, Ruby’s thread scheduler is a simple cooperative timeslicing scheduler with control switching to another thread if certain well defined keywords or events are encountered. There is also a 10ms timeout period to prevent too many context switches occurring (i.e. in general a max of 1 context switch every 10ms).

This use of green threads imposes severe scaling restrictions for Ruby applications that are CPU bound since the use of a single native OS thread limits the Ruby application to run on a single CPU core. IO bound Ruby applications can employ threading to a certain extent to parallelize waiting on IO operations but even this is limited by the 10ms minimum context switch time which has the effect of limiting the number of threads that can run within a Ruby application. Due to this limitation, scalability of Ruby applications appears to be solved today by splitting the application and running it in multiple processes which can then be run on different cores.

There is some hope in store though in that using native OS threads instead of green threads is being considered for Ruby 2.0 and there are some implementations of Ruby such as JRubywhich currently implement Ruby threads using native OS threads (via Java though for JRuby).

PYTHON
In contrast to Ruby, Python threads are implemented using native OS threads and so it is possible for different Python threads within a single application to run on different cores on a multi-core processor under the control of the OS scheduler. However, Python threading has a serious limitation in the form of the Global Interpreter Lock(GIL). This is a global lock that must be held by the current thread before it can safely access Python objects and only the thread that has acquired the global interpreter lock may operate on Python objects or call Python/C API functions. In order to support multi-threaded python programs, the interpreter regularly releases and reacquires the lock – by default, every 100 bytecode instructions. C extensions can release and reacquire the lock using the Python API and so this offers some relief, but the lock must be acquired before the state of any Python object is accessed.

Similar to Ruby, this GIL effectively limits the performance of CPU bound Python applications to that of a single CPU core (since only one Python thread can run at a time). Scalability is available for IO bound applications as these can easily scale across cores and the “one at a time” model of the GIL does not significantly restrict the performance of threads that are highly IO bound. Some relief is available by being able to implement performance and lock optimized C extensions but this is very restrictive and cumbersome – certainly a lot harder than writing some Python code.

Given this serious restriction of the Python threading model, you would expect it to be possible to replace the GIL with more fine grained locking, but apparently it has been tried and there are some reasons why we can’t get rid of the global interpreter lock. When fine grained locking was tried as a patch to Python 1.5, a 2x slowdown was observed. The slowdown was attributed to the overhead of the acquiring/releasing the OS locks. This patch hasn’t been maintained for subsequent versions of Python. Another patch that is gaining popularity and actively being maintained is python-safethread. This is a set of Python extensions that is “intended to provide safe, easy, and scalable concurrency mechanisms. It focuses on local concurrency, not distributed or parallel programs.” While it is not yet part of the Python mainline but it is certainly a promising solution to the GIL issue.

Update: Thanks to Adam Olsen for pointing me towards python-safethread as a possible solution to the GIL.