Writing a demonstration code (matrix multiplication) for my students in order to show that one MUST use the cache correctly even when using parallel code, I have found that using C++2011 threads ( via boost::thread
) outperforms OpenMP (parallel for) threads by more then a factor 2 !
The only possible explanation I can imagine is that in C++2011 each thread always runs on the same core. So there is the possibility to keep data on cache.
Could this be real ?
The demo code is quite long (and boring same matrix init and multiply repeated four times in the correct, wrong ( bad cache access ), scalar and OpenMP ) and can be found at :
http://www.giuseppelevi.com/uploads/3/2/9/8/3298932/matrix_mul_boost_thread.cpp
The compiler I'm using is VS2010. BOOST 1.45 Processor : Intel Core I5 M430.
omp_get_wtime()
was used to profile each part of it.
In order to give some numbers in a run I obtained this figures:
Scalar execution time: 14.42 sec
BOOST Thread: 2.28
OpenMP Thread: 5.10
Because there are only 2 physical cores with hyperthreading the 6.31 speedup ( vs scalar ) obtained by BOOST Thread is quite surprising and "anomalous".