Monday, 10 October 2011

Multi-core software and Thread Building Blocks from Intel

As processor clock speeds are nearing their physical maximums, CPU vendors have turned to increasing the number of cores.
For eg, Intel 4 core is now commonplace, and the company promises an 80 core one in 4 years time.

What does this mean to us, software engineers?
Until now, we enjoyed a "free lunch" of more speed whenever a new CPU with a higher clock speed was released, without changing a single line of code.
But this is no longer true for multi-core. Because, upgrading to an N core cpu doesn't automatically mean that the same application will now run N times faster.
The application has to be split explicitly into several threads, which could be launched simultaneously on different cores to take advantage of the parallel computation power.

I was looking into how this will affect the way we design of our applications in TID.

The most common way to design applications to take advantage of parallel computing power is to use threads.
So we use a threading library, such as pthreads for eg. Then we design the application from serial to parallel work-flows, using these threads. Sounds good.

But, the problems with this are concurrency issues, synchronization issues, and scalability issues.
While concurrency and synchronization can be solved by clever design, scalability is not. If we have a new cpu with more cores, we need to increase the number of threads
to take advantage of these new cores. This means re-writing the code MANUALLY and ASSUMING the underlying hardware.

While playing with this, I came across the Open source Thread Building blocks from Intel. ( http://www.threadingbuildingblocks.org/ )

This is a library (also known as TBB) based on C++ templates developed by Intel for writing software programs that take advantage of multi-core processors.
This library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, etc as above.

TBB abstracts access to the multiple processors by allowing the operations to be treated as "tasks," which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach decouples the programming from the particulars of the underlying machine.

The library provides some "template methods" that you call, and it will take care of creating optimum number of threads, based on cpu cores and balances these.

To test this claim by Intel, I made a simple application of Image Addition and ran it on my Pentium Dual Core.

The idea was to add two large images(1200X1200), pixel by pixel, and produce a third Image.
First I did it in one thread, using a nested for loop, and then I did it by using TBB, which I downloaded from the site above.
(Refer to the code in main.C, enclosed)

The Serial way :
for( int j=0;j
for( int i=0;i
//sum pixels from ImageA and ImageB, store to ImageC
}

Now the Parallel way, using TBB:
parallel_for(blocked_range(0, height, 100),
ImageSummer( imA, imB, imD ) );
...

Here,
- parallel_for is a template function, that basically launches parallel tasks within a range, (0 up to height), dividing the range into sub-ranges, with a granularity of 100.
- ImageSummer is a C++ class that you need to implement with overloaded () operator, where you do the meat of the thread's code.

void ImageSummer::operator() ( const blocked_range& r ) const {
for ( int j = r.begin(); j != r.end(); j++ ) { // iterates over the entire chunk
for( int i=0;iwidth;i++ ) {
//sum pixels from ImageA and ImageB, store to ImageC
}

The timing results were :
serial : 1502665626 nanoseconds
TBB : 829685710 nanoseconds. i.e, faster by 670 ms, or .6 seconds.

I verified that these tasks are indeed launched on two cores, by the gnome-system-monitor and I saw two peaks in the parallel case(for the two cores), and one peak(only in 1 cpu) in the first case.


parallel_for is just one of the many such abstractions available in this library. Other operations include:
parallel_for
parallel_while
parallel_reduce
pipeline
parallel_sort
parallel_scan

More details of these can be found in the website mentioned above.

Under the hood, TBB implements "task stealing" to balance a parallel workload across available processing cores in order to increase core utilization and therefore scaling. The TBB task stealing model is similar to the work stealing model applied in Cilk. Initially, the workload is evenly divided among the available processor cores. If one core completes its work while other cores still have a significant amount of work in their queue, TBB reassigns some of the work from one of the busy cores to the idle core. This dynamic capability DECOUPLES the programmer from the machine, allowing applications written using the library to scale to utilize the available processing cores with no changes to the source code or the executable program file.


I am still trying to get into grips with the other features provided. But so far it sounds interesting.
In summary, my take on this is as follows:

Pros:
- Excellent for abstracting multi-threading for multi-core CPUs and avoiding pitfalls at the same time
- Comes from Intel, and hence, fine-tuned for Intel processors.
- code automatically scales to number of cores, without recompiling.
- is Open source with similar licensing as C++ STL, with a commercial version with no functionality difference is also available.

Cons:
- Template based, and C++ based. But you don't have to be a C++ expert to use it
- So the executables might be bulky. Debugging might not be as straightforward as with simple C.

But without some pain, there is no gain, no ?





Monday, 6 June 2011

Latencytop : a tool for measuring latency at kernel level

The 2.6.25 kernel of linux has a new feature called "support for latency measuring". (http://kernelnewbies.org/Linux_2_6_25)

Basically what this means is that, with a tool called Latencytop, we can measure process latencies (the time that a process waits,
for different resource access), and get a report for latencies for each process and for the whole system. (http://latencytop.org/announce.php)

A sample output is pasted below:
Cause Maximum Percentage
fsync() on a file 1351.0 msec 14.5 %
Deleting an inode 79.9 msec 0.9 %
Creating directory 30.5 msec 0.5 %
Scheduler: waiting for cpu 28.5 msec 55.8 %
Waiting for event (select) 5.0 msec 19.3 %
Waiting for event (poll) 5.0 msec 7.8 %
Userspace lock contention 4.4 msec 0.3 %
SCSI ioctl command 1.8 msec 0.1 %
Receiving TCP/IP data 1.1 msec 0.8 %


This can be used to identify bottlenecks in a system, and tune it appropriately.

I evaluated this tool and found it interesting.
Here is the step that I followed to get it working..

1. download vanilla kernel sources (2.6.25)
2. Enable the option CONFIG_LATENCYTOP in the menuconfig (so that the kernel can be told to collect these statistics)
3. build and boot with the new kernel.
4. download the latencytop source from (http://latencytop.org/announce.php)
5. make and make install latencytop
6. run sudo ./latencytop, and you get the report above.

There is very little documentation on this tool, as it is very recent. So reading the code of the program is the only way to understand some options.
I figured out that basically it keeps a list of all processes running, and in a round robin fashion, polls the latencies of these list of processes periodically.
The default period is 30 seconds, but you can change the function update_display() in the latencytop.c to get the polling period that we want.

I guess this, along with the "latency queries" of gstreamer pipelines could be useful in some way to trim down latencies in media pipelines. Any ideas ?

Friday, 27 May 2011

Science in School

Science Day at Gerbert d'Orlhac school:

The desire to ask 'what would happen if?' is an innate quality that we all share. In many instances, satisfying this curiosity will require experiments to be designed and carried out, a process that is fundamental to almost all fields of research and innovation.

Experimentation is very much a hands-on activity that at school is usually associated with lessons in science and technology. Indeed, of the few memories we have of our lessons at school, many would involve some kind of science experiment such as growing crystals, building a radio, or looking at plants through a microsope.

We are "the innovation partners" - a group of people passionate about kindling the passion for science in schools







Thursday, 5 May 2011

Gstreamer for Real Time priority tasks..

Since the support for Real-Time has been getting better and better in
linux, I propose that gstreamer should take advantage of this.
This would help pipelines with elements having time-critical or heavy
load where there is a demand for guaranteed cpu access.
This could be achieved by giving higher scheduling priority to such
plugins over others.

Hence, my proposal is to have a property called "priority" for all
elements. Then, pipelines like the following one could be launched:

gst-launch videotestsrc ! ffmpegcolorspace prio=90 ! xvimagesink

Here, the thread containing ffmpegcolorspace's buffer processing
method will get a higher priority than the rest.

Another example would be a camera capture plugin, where there is a
demand to capture frames with high accuracy.

A layered priority is also possible, as below.
gst-launch cameracapture prio=90 ! queue ! ffmpegcolorspace
prio=50 ! queue ! ffenc_h264 prio=50 ! ...


Implementation (I am sure there could be better ways, but this is what I tried):
----------------------------------------------------------------------------------------------------------------
The files modified are enclosed.

1. I installed a new property called prio in gst-object.c, after the
usual "name" property, in the method gst_object_class_init
Also, added a new field called _prio in gstobject class.
now all elements have prio property.


2. In gstpads.c, method gst_pad_push_event (GstPad * pad, GstEvent
* event) method of gstpads.c,

GstObject* objPar = (GstObject*) GST_OBJECT_PARENT (pad);
if(objPar->_prio!=0){
printf("########### Setting priority of '%s' running in
thread %x to %d ######## \n",
objPar->name,
pthread_self(),
objPar->_prio);

struct sched_param param;
memset(&param, 0, sizeof(sched_param));
param.sched_priority = objPar->_prio;
if(sched_setscheduler(0, SCHED_FIFO, &param) == -1) {
printf("error in setting priority >>>>>>>>>>> \n");
}
}

This applies priorities to the threads where the element's
processing function runs.


Questions
---------------
- Is such a mechanism of different priorities useful ?
I am testing some pipelines on machines that have high loads, and see
good improvement in the overall speed of the pipeline, compared to no
priority case.
However, it needs to be seen if it introduces additional complexity.
Has anybody tested something like this?
Please give some suggestions...

-Rosh

Tuesday, 11 January 2011

CPU affinity Testing experiment

CPU affinity Testing experiment




Summary:
CPU affinity property has been added to the gstreamer plugins.
Objective of this test is to know whether the plugins respect the affinity that is assigned.


consider this simple pipeline
gst-launch fakesrc ! sleep ! fakesink

          (sleep is a plugin that has busy wait inside its chain function.)
This shows the following cpu load distribution. (red is one cpu and blue is another)
http://rosh/images/cpu_simple_1.jpg

Now, consider 2 sources, instead of 1
gst-launch fakesrc ! sleep ! fakesink  fakesrc ! sleep ! fakesink 



The graph looks like following:
http://rosh/images/cpu_simple_2.jpg
Obviously, the kernel was intelligent in scheduling it to two cores and the two cpus where busy.

But, now with our cpu affinity setting fix, we could "force" the kernel to schedule both these plugins to the same core
as below:
gst-launch fakesrc ! sleep cpu_affinity=1 ! fakesink  fakesrc ! sleep cpu_affinity=1 ! fakesink 

Now, the picture  looks like this:
http://rosh/images/cpu_simple_3.jpg
http://rosh/images/cpu_simple_4.jpg

This means, that we are able to schedule the plugins to the core we want, using the cpu_affinity property that we have added.
"sleep" could be replaced with any other plugin, that we may be interested in.

- Rosh
31 Oct 2008