Parallelization of nested loops with OpenMP

Question

I was trying to parallelize the following loop in my code with OpenMP

 double pottemp,pot2body;
 pot2body=0.0;
 pottemp=0.0;

 #pragma omp parallel for reduction(+:pot2body) private(pottemp) schedule(dynamic)
 for(int i=0;i<nc2;i++)
    {
     pottemp=ener2body[i]->calculatePot(ener2body[i]->m_mols);
     pot2body+=pottemp;
    }

For function 'calculatePot', a very important loop inside this function has also been parallelized by OpenMP

   CEnergymulti::calculatePot(vector<CMolecule*> m_mols)
   {
        ...

        #pragma omp parallel for reduction(+:dev) schedule(dynamic)
        for (int i = 0; i < i_max; i++)
        {
         ...
         }
     }

So it seems that my parallelization involves nested loops. When I removed the parallelization of the outmost loop, it seems that the program runs much faster than the one with outmost loop parallelized. The test was performed on 8 cores.

I think this low efficiency of parallelization might be related to nested loops. Someone suggests me using 'collapse' while parallelizing the outmost loop. However, since there are still something between the outmost loop and the inner loop, it was said 'collapse' cannot be used under this circumstance. Are there any other ways I could try to make this parllelization more efficient while still using OpenMP?

Thanks a lot.

Why do you need to parallize poth loops? If calculatePot is long running enough to warrant parallelization of the contained loop, it should offer enough parallelization to use up all availible parallel resources. If it doesn't always do this you could use an omp if-clause not to parallelize the inner loop when it isn't useful — Grizzly

Unknown Unknown · Accepted Answer · 2013-04-10T11:16:06

If i_max is independent of the i in the outerloop you can try fusing the loops (essentially collapse). It's something I do often which often gives me a small boost. I also prefer fusing the loops "by hand" rather than with OpenMP because Visual Studio only supports OpenMP 2.0 which does not have collapse and I want my code to work on Windows and Linux.

#pragma omp parallel for reduction(+:pot2body) schedule(dynamic)
for(int n=0; n<(nc2*i_max); n++) {
    int i = n/i_max; //i from outer loop
    int j = n%i_max; //i from inner loop 
    double pottmp_j = ... 
    pot2body += pottmp_j;
}

If i_max depends on j then this won't work. In that case follow Grizzly's advice. But one more thing to you can try. OpenMP has an overhead. If i_max is too small then using OpenMP could actually be slower. If you add an if clause at the end of the pragma then OpenMP will only run if the statement is true. Like this:

const int threshold = ... // smallest value for which OpenMP gives a speedup.
#pragma omp parallel for reduction(+:dev) schedule(dynamic) if(i_max > threshold)

Parallelization of nested loops with OpenMP

1 Answers