OpenMP parallel-for efficiency query

Question

Please consider the following simple code for summing up values in a parallel for loop:

int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
    reduction(+:nTotalSum)
    for (int i = 0; i < 4; i++)
    {
        nTotalSum += i;
        cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
    }

When I run this on a two-core machine, the output I get is

0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5

This suggests to me that the critical section, i.e. the update of nTotalSum, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum with this 'local sum' after it has done so.

Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:

#pragma omp parallel for num_threads(nMaxThreads) \
    reduction(+:nTotalSum)
    int nLocalSum = 0;
    for (int i = 0; i < 4; i++)
    {
        nLocalSum += i;
    }
    nTotalSum += nLocalSum;

...but the compiler complained stating that it was expecting a for loop following the pragma omp parallel for statement...

Zulan Zulan · Accepted Answer · 2017-05-10T19:06:15

Your output does in fact not indicate a critical section during the loop. Each thread has its own zero-initialized copy, thread 0 working on i = 0,1, thread 1 working on i = 2,3. At the end OpenMP takes care of adding the local copies to the original.

You should not try to implement it yourself unless you have specific evidence that you can do it more efficiently. See for example this question / answer.

Your manual version would work if you split the parallel / for into two directives:

int nTotalSum = 0;
#pragma omp parallel
{
  // Declare the local variable it here!
  // Then it's private implicitly and properly initialized
  int localSum = 0;
  #pragma omp for
  for (int i = 0; i < 4; i++) {
    localSum += i;
    cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
  }
  // Do not forget the atomic, or it would be a race condition!
  // Alternative would be a critical, but that's less efficient
  #pragma omp atomic
  nTotalSum += localSum;
}

I think it's likely that your OpenMP implementation does the reduction just like that.

OpenMP parallel-for efficiency query

3 Answers