2
votes

I'm currently learning Parallel Programming using C and OpenMP. I wanted to write simple code where two shared values are beeing incremented by multiple threads. Firstly I used reduction directive and it worked as it was meant to. Then I switched to using the critical directive to initiate critical section - it also worked. Out of curiosity I've tried to merge these two solution and check the behaviour. I expected two valid, equal values.

code:

#include <stdio.h>
#include <stdlib.h>
#include "omp.h"

#define ITER 50000

int main( void )
{
    int x, y;
    #pragma omp parallel reduction(+:x,y)
    {
       #pragma omp for
       for (int i = 0; i < ITER; i++ )  
       {
            x++;
            #pragma omp critical
            y++;
       }
    }

    printf("non critical = %d\ncritical = %d\n", x, y);
    return 0;
}

output:

non critical = 50000
critical = 4246432

Of course output is random when it comes to 'critical' (variable y), the other behaves as expected and is always 50000.

The behaviour of x is understandable - reduction makes it private in scope of single thread. After the incrementation values from threads are summed up and passed to the non-local x.

What I don't understand is the behaviour of y. It's private just like x but it's also inside the critical section so it 'has more than one reason' to be inaccessible from other threads. Yet what, I think, happens is the race condition. Did the critical somehow made y public (shared)?

I know that this code makes no sense since it's enough to use only one of reduction / critical. I'd like just to know what's behind such behaviour.

2
And who initializes the variables? Seems like UB.2501
OpenMP does that in reduction directive automatically (0 for + and -, 1 for * and /)user4433856
For private copies, but not for the visible variable. What happens if you initialize it?2501
I was convinced that the reduction implicitly privatises reduced values.user4433856
@nullPointer You are technically correct. But why rely on OpenMP implicit behavior when you can be verbose and fully convince yourself that what you are writing is working exactly as you expect it to?NoseKnowsAll

2 Answers

6
votes

The primary problem with your code is that x and y are not initialized. A second problem is that the variable used in the critical section should be shared instead of a reduction variable, although this should only affect performance, not correctness.

I've corrected your code and modified it to demonstrate how reduce, critical and atomic all produce the same result.

Source

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char* argv[])
{
    int iter = (argc>1) ? atoi(argv[1]) : 50000;
    int r=0, c=0, a=0;

    printf("OpenMP threads = %d\n", omp_get_max_threads() );

    #pragma omp parallel reduction(+:r) shared(c,a)
    {
        #pragma omp for
        for (int i = 0; i < iter; i++ ) {
            r++;
            #pragma omp critical
            c++;
            #pragma omp atomic
            a++;
        }
    }
    printf("reduce   = %d\n"
           "critical = %d\n"
           "atomic   = %d\n", r, c, a);
    return 0;
}

Compile

icc -O3 -Wall -qopenmp -std=c99 redcrit.c

Output

OpenMP threads = 4
reduce   = 50000
critical = 50000
atomic   = 50000
6
votes

Your code simply exhibits undefined behaviour and the presence of critical has nothing to do with you getting wrong results.

Did the critical somehow made y public (shared)?

No, it did not. It only slows down the loop by preventing the concurrent execution of the threads.

What you are missing is that the result of the reduction operation is combined with the initial value of the reduction variable, i.e. with the value the variable had before the parallel region. In your case, both x and y have random initial values and therefore you are getting random results. That the initial value x happens to be 0 in your case and that's why you are getting the correct result for it is simply UB. Initialising both x and y makes your code behave as expected.

The OpenMP specification states:

The reduction clause specifies a reduction-identifier and one or more list items. For each list item, a private copy is created in each implicit task or SIMD lane, and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier.

Here is the execution of your original code with 4 threads:

$ icc -O3 -openmp -std=c99 -o cnc cnc.c
$ OMP_NUM_THREADS=1 ./cnc
non critical = 82765
critical = 50000
$ OMP_NUM_THREADS=4 ./cnc
non critical = 82765
critical = 50000
$ OMP_NUM_THREADS=4 ./cnc
non critical = 50000
critical = 50000
$ OMP_NUM_THREADS=4 ./cnc
non critical = 82765
critical = 50194
$ OMP_NUM_THREADS=4 ./cnc
non critical = 82767
critical = 2112072800

The first run with one thread demonstrates that it is not due to a data race.

With int x=0, y=0;:

$ icc -O3 -openmp -std=c99 -o cnc cnc.c
$ OMP_NUM_THREADS=4 ./cnc
non critical = 50000
critical = 50000
$ OMP_NUM_THREADS=4 ./cnc
non critical = 50000
critical = 50000
$ OMP_NUM_THREADS=4 ./cnc
non critical = 50000
critical = 50000
$ OMP_NUM_THREADS=4 ./cnc
non critical = 50000
critical = 50000