2
votes

Writing a demonstration code (matrix multiplication) for my students in order to show that one MUST use the cache correctly even when using parallel code, I have found that using C++2011 threads ( via boost::thread ) outperforms OpenMP (parallel for) threads by more then a factor 2 !

The only possible explanation I can imagine is that in C++2011 each thread always runs on the same core. So there is the possibility to keep data on cache.

Could this be real ?

The demo code is quite long (and boring same matrix init and multiply repeated four times in the correct, wrong ( bad cache access ), scalar and OpenMP ) and can be found at :

http://www.giuseppelevi.com/uploads/3/2/9/8/3298932/matrix_mul_boost_thread.cpp

The compiler I'm using is VS2010. BOOST 1.45 Processor : Intel Core I5 M430. omp_get_wtime() was used to profile each part of it. In order to give some numbers in a run I obtained this figures:

Scalar execution time: 14.42 sec

BOOST Thread: 2.28

OpenMP Thread: 5.10

Because there are only 2 physical cores with hyperthreading the 6.31 speedup ( vs scalar ) obtained by BOOST Thread is quite surprising and "anomalous".

1
I'm going to downvote just because there isn't much relevant information here. Post code, profiling information, etc.Pubby
How does boost::thread version runs VS omp version with 1 thread only? Try changing omp scedule to dynamic/static with various chunking.Anycorn
Comparing two different implementations, both using threads, does not show superlinear acceleration. Superlinear acceleration means that when we double the number of threads/cores used (but the program is otherwise the same) we get more than double the performance. I don't think this is ever seen. Approaching a linear improvement is the ideal that designers strive for. Even if the threads have completely independent work which is integrated into the final result at a very low cost, how can it be superlinear.Kaz
I believe that it is true that each Windows thread is pegged to a particular core. When a core has a context switch, this does not force a dump and reload of the L3 cache. L3 is reloaded as needed. So if a core has a quick switch out of context and then back again, the L3 cache stays mostly intact.ThomasMcLeod
Something is very strange here... I would be very surprised if you even got 3x speedup with hyperthreading, but 6x with 2 cores + HT is just impossible. Either your serial version is somehow unnaturally slow, or the parallel version does not work correctly. Have you tried to verify at the end if the parallel implementations return correct results?Tudor

1 Answers

0
votes

By default, schedule(static,1) will be used, which often results in false sharing, and can result in sometimes suboptimal scheduling. Try schedule(guided) instead, this often results in much better scaling.