0
votes

Please consider the following simple code for summing up values in a parallel for loop:

int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
    reduction(+:nTotalSum)
    for (int i = 0; i < 4; i++)
    {
        nTotalSum += i;
        cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
    }

When I run this on a two-core machine, the output I get is

0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5

This suggests to me that the critical section, i.e. the update of nTotalSum, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum with this 'local sum' after it has done so.

Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:

#pragma omp parallel for num_threads(nMaxThreads) \
    reduction(+:nTotalSum)
    int nLocalSum = 0;
    for (int i = 0; i < 4; i++)
    {
        nLocalSum += i;
    }
    nTotalSum += nLocalSum;

...but the compiler complained stating that it was expecting a for loop following the pragma omp parallel for statement...

3

3 Answers

2
votes

Your output does in fact not indicate a critical section during the loop. Each thread has its own zero-initialized copy, thread 0 working on i = 0,1, thread 1 working on i = 2,3. At the end OpenMP takes care of adding the local copies to the original.

You should not try to implement it yourself unless you have specific evidence that you can do it more efficiently. See for example this question / answer.

Your manual version would work if you split the parallel / for into two directives:

int nTotalSum = 0;
#pragma omp parallel
{
  // Declare the local variable it here!
  // Then it's private implicitly and properly initialized
  int localSum = 0;
  #pragma omp for
  for (int i = 0; i < 4; i++) {
    localSum += i;
    cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
  }
  // Do not forget the atomic, or it would be a race condition!
  // Alternative would be a critical, but that's less efficient
  #pragma omp atomic
  nTotalSum += localSum;
}

I think it's likely that your OpenMP implementation does the reduction just like that.

2
votes

Each OMP thread has its own copy of nTotalSum. At the end of the OMP section these are combined back into the original nTotalSum. The output you're seeing comes from running loop iterations (0,1) in one thread, and (2,3) in another thread. If you output nTotalSum at the end of your loop, you should see the expected result of 6.

In you nLocalSum example, move the declaration of nLocalSum to before the #pragma omp line. The for loop must be on the line immediately following the pragma.

0
votes

from my parallel programming in openmp book:

reduction clause can be trickier to understand, has both private and shared storage behavior. The reduction attribute is used on objects that are the target of an arithmetic reduction. This can be important in many applications...reduction allows it to be implemented by the compiler efficiently... this is such a common operation that openmp has the reduction data scope clause just to handle them...most common example is final summation of temporary local variables at the end of the parallel construct.

correction to your second example:

total_sum = 0;  /* do all variable initialization prior to omp pragma */

#pragma omp parallel for \
            private(i) \
            reduction(+:total_sum)

   for (int i = 0; i < 4; i++)
   {
       total_sum += i;  /* you used nLocalSum here */
   }

#pragma omp end parallel for

/* at this point in the code,
   all threads will have done your `for` loop where total_sum is local to each thread,
   openmp will then '+" together the values in `total_sum` coming from each thread because we used reduction,
   do not do an explicit nTotalSum += nLocalSum after the omp for loop, it's not needed the reduction clause takes care of this
*/

In your first example, I'm not sure of your use of #pragma omp parallel for num_threads(nMaxThreads) reduction(+:nTotalSum) of what num_threads(nMaxThreads) is doing. But i suspect the weird output might be caused by print buffering.

In any case, the reduction clause is very useful and very efficient if used properly. It would be more obvious in a more complicated, real-world example.

Your posted example is so simple that it doesn't show off the usefulness of the reduction clause, and strictly speaking for your example since all threads are doing a summation the most efficient way to do it would just make total_sum a shared variable in the parallel section and have all threads pump in to it. At the end the answer would still be correct. would work if using critical directive.