I am trying to increase performance of a rather complex iteration algorithm by parallelizing matrix multiplication, which is being called on each iteration. The algorithm takes 500 iterations and approximately 10 seconds. But after parallelizing matrix multiplication it slows down to 13 seconds. However, when I tested matrix multiplication of the same dimension alone, there was an increase in speed. (I am talking about 100x100 matrices.)
Finally, I switched off any parallelizing inside the algorithm and added on each iteration the following piece of code, which does absolutely nothing and presumably shouldn't take long:
int j;
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
And again, there is a 30% slowdown comparing to the same algorithm without this piece of code.
Thus, calling any parallelization using openmp 500 times inside the main algorithm somehow slows things down. This behavior looks very strange to me, anybody has any clues what the problem is?
The main algorithm is being called by a desktop application, compiled by VS2010, Win32 Release. I work on Intel Core i3 (parallelization creates 4 threads), 64 bit Windows 7.
Here is a structure of a program:
int internal_method(..)
{
...//no openmp here
// the following code does nothing, has nothing to do with the rest of the program and shouldn't take long,
// but somehow adding of this code caused a 3 sec slowdown of the Huge_algorithm()
double sum;
#pragma omp parallel for private(sum)
for (int i = 0; i < 10; i++)
sum = i*i*i / (1.0 + i*i*i*i);
...//no openmp here
}
int Huge_algorithm(..)
{
...//no openmp here
for (int i = 0; i < 500; i++)
{
.....// no openmp
internal_method(..);
......//no openmp
}
...//no openmp here
}
So, the final point is: calling the parallel piece of code 500 times alone (when the rest of the algorithm is omitted) takes less than 0.01 sec, but when you call it 500 times inside a huge algorithm it causes 3 sec delay of the entire algorithm. And what I don't understand is how the small parallel part affects the rest of the algorithm?