Please consider the following simple code for summing up values in a parallel for
loop:
int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
for (int i = 0; i < 4; i++)
{
nTotalSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
When I run this on a two-core machine, the output I get is
0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5
This suggests to me that the critical section, i.e. the update of nTotalSum
, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum
with this 'local sum' after it has done so.
Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
int nLocalSum = 0;
for (int i = 0; i < 4; i++)
{
nLocalSum += i;
}
nTotalSum += nLocalSum;
...but the compiler complained stating that it was expecting a for
loop following the pragma omp parallel for
statement...