resonations...: October 2011

As processor clock speeds are nearing their physical maximums, CPU vendors have turned to increasing the number of cores.
For eg, Intel 4 core is now commonplace, and the company promises an 80 core one in 4 years time.

What does this mean to us, software engineers?
Until now, we enjoyed a "free lunch" of more speed whenever a new CPU with a higher clock speed was released, without changing a single line of code.
But this is no longer true for multi-core. Because, upgrading to an N core cpu doesn't automatically mean that the same application will now run N times faster.
The application has to be split explicitly into several threads, which could be launched simultaneously on different cores to take advantage of the parallel computation power.

I was looking into how this will affect the way we design of our applications in TID.

The most common way to design applications to take advantage of parallel computing power is to use threads.
So we use a threading library, such as pthreads for eg. Then we design the application from serial to parallel work-flows, using these threads. Sounds good.

But, the problems with this are concurrency issues, synchronization issues, and scalability issues.
While concurrency and synchronization can be solved by clever design, scalability is not. If we have a new cpu with more cores, we need to increase the number of threads
to take advantage of these new cores. This means re-writing the code MANUALLY and ASSUMING the underlying hardware.

While playing with this, I came across the Open source Thread Building blocks from Intel. ( http://www.threadingbuildingblocks.org/ )

This is a library (also known as TBB) based on C++ templates developed by Intel for writing software programs that take advantage of multi-core processors.
This library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, etc as above.

TBB abstracts access to the multiple processors by allowing the operations to be treated as "tasks," which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach decouples the programming from the particulars of the underlying machine.

The library provides some "template methods" that you call, and it will take care of creating optimum number of threads, based on cpu cores and balances these.

To test this claim by Intel, I made a simple application of Image Addition and ran it on my Pentium Dual Core.

The idea was to add two large images(1200X1200), pixel by pixel, and produce a third Image.
First I did it in one thread, using a nested for loop, and then I did it by using TBB, which I downloaded from the site above.
(Refer to the code in main.C, enclosed)

The Serial way :
for( int j=0;j
for( int i=0;i
//sum pixels from ImageA and ImageB, store to ImageC
}

Now the Parallel way, using TBB:
parallel_for(blocked_range(0, height, 100),
ImageSummer( imA, imB, imD ) );
...

Here,
- parallel_for is a template function, that basically launches parallel tasks within a range, (0 up to height), dividing the range into sub-ranges, with a granularity of 100.
- ImageSummer is a C++ class that you need to implement with overloaded () operator, where you do the meat of the thread's code.

void ImageSummer::operator() ( const blocked_range& r ) const {
for ( int j = r.begin(); j != r.end(); j++ ) { // iterates over the entire chunk
for( int i=0;iwidth;i++ ) {
//sum pixels from ImageA and ImageB, store to ImageC
}

The timing results were :
serial : 1502665626 nanoseconds
TBB : 829685710 nanoseconds. i.e, faster by 670 ms, or .6 seconds.

I verified that these tasks are indeed launched on two cores, by the gnome-system-monitor and I saw two peaks in the parallel case(for the two cores), and one peak(only in 1 cpu) in the first case.

parallel_for is just one of the many such abstractions available in this library. Other operations include:
parallel_for
parallel_while
parallel_reduce
pipeline
parallel_sort
parallel_scan

More details of these can be found in the website mentioned above.

Under the hood, TBB implements "task stealing" to balance a parallel workload across available processing cores in order to increase core utilization and therefore scaling. The TBB task stealing model is similar to the work stealing model applied in Cilk. Initially, the workload is evenly divided among the available processor cores. If one core completes its work while other cores still have a significant amount of work in their queue, TBB reassigns some of the work from one of the busy cores to the idle core. This dynamic capability DECOUPLES the programmer from the machine, allowing applications written using the library to scale to utilize the available processing cores with no changes to the source code or the executable program file.

I am still trying to get into grips with the other features provided. But so far it sounds interesting.
In summary, my take on this is as follows:

Pros:
- Excellent for abstracting multi-threading for multi-core CPUs and avoiding pitfalls at the same time
- Comes from Intel, and hence, fine-tuned for Intel processors.
- code automatically scales to number of cores, without recompiling.
- is Open source with similar licensing as C++ STL, with a commercial version with no functionality difference is also available.

Cons:
- Template based, and C++ based. But you don't have to be a C++ expert to use it
- So the executables might be bulky. Debugging might not be as straightforward as with simple C.

But without some pain, there is no gain, no ?

resonations...

Monday, 10 October 2011

Multi-core software and Thread Building Blocks from Intel