0
votes

I have a situation that I need to repeat a specific iteration of the loop multiple times. So, in that specific iteration, I am reducing the index one step so that next increment of the loop index makes no difference.

This approach, which is the approach I have to implement, works for multi-threaded OpenMP codes. However, it does not work for OpenACC (for both multicore and tesla targets). I get the following error:

Floating point exception (core dumped)

Here is the code for both of cases:

#include <stdio.h>
#include <omp.h>
#include <unistd.h>

int main() {

    int x = 52;
    int count = 5;
    int i;

    omp_set_num_threads(6);

    #pragma omp parallel for
    for(i=0;i<100;i++) {
        if(i == x) {
            printf("%d\n", i);
            i--;
            count--;
            if(count == 0)
                x = 10000;
        }
    }


    int gpu_count = 0;
    count = 5;
    x = 52;

    #pragma acc parallel loop independent
    for(i=0;i<1000000;i++) {
        if(i == x) {
            #pragma acc atomic
            gpu_count++;

            i--;
            count--;
            if(count == 0)
                x = 2000000;
        }
    }
    printf("gpu_count: %d\n", gpu_count);


    return 0;
}

For OpenMP, I get the correct output:

52
52
52
52
52

But, for the OpenACC, I get the abovementioned error.

If I comment line 35 (i--;), the code will be executed correctly and it will output number of repeated iterations (which is 1).

Note: I am using PGI 16.5 with Geforce GTX 970 and CUDA 7.5.

I compile with PGI compiler like following:

pgcc -mp -acc -ta=multicore -g f1.c

So, my question is: why I see such a behavior? Can't I change the loop index variable in OpenACC?

1
I would be indeed quite surprised if that was allowed. You are not supposed to do such things in parallel loops in general, not only in OpenACC (which I don't know well enough to answer). - Vladimir F
If you need to repeat something inside one iteration, do it some other way. Have a while loop inside, for example. - Vladimir F
@VladimirF: thanks for your edits. I intentionally put the OpenACC at the beginning to make it more friendly to all users. About while approach that you proposed, actually, my intention for above approach was to omit an inner while loop. That's why I wanted to do it like above. - Millad
Don't put tags at the beginning like that. The list of tags is for that. meta.stackexchange.com/questions/19190/… - Vladimir F

1 Answers

1
votes

Your OpenMP version is in error. You're relying on a static schedule where the chunk size is larger than "count". If you increase the number of OMP threads so the chunk size is smaller than count, or if you change the schedule to interleave the chunks (i.e. "schedule(static,1)"), then you'll get wrong answers. There's also race conditions on "x" and "count".

Note that OpenACC scheduling is more like OpenMP "static,1" so that vectors can access contiguous blocks of memory across a worker (aka a CUDA warp). So your algorithm wont work here as well.

Also, by using the "independent" clause (which is implied when using "parallel loop"), you are asserting to the compiler that this loop does not contain dependencies or that the user will handle them via the "atomic" directive. However, changing the loop index variable inside the body of the loop will create a loop dependency since the value of the loop index depends on if the previous iteration changed it's value.

Edit: Below is an example which is a parallelizable version of your code.

% cat test2.c
#include <stdio.h>
#include <omp.h>
#include <unistd.h>

int main() {

    int x = 52;
    int count = 5;
    int i;
    int mycnt;

    #pragma omp parallel for schedule(static,1) private(mycnt)
    for(i=0;i<100;i++) {
        if(i == x) {
          mycnt = count;
          while(mycnt > 0) {
            printf("%d\n", i);
            mycnt--;
          }
        }
    }

#ifdef _OPENACC
    int gpu_count = 0;
    #pragma acc parallel loop reduction(+:gpu_count)
    for(i=0;i<1000000;i++) {
        if(i == x) {
          mycnt = count;
          while(mycnt > 0) {
            gpu_count++;
            mycnt--;
          }
        }
    }
    printf("gpu_count: %d\n", gpu_count);
#endif
    return 0;
}

% pgcc -fast -mp -acc test2.c  -Minfo=mp,acc
main:
     13, Parallel region activated
         Parallel loop activated with static cyclic schedule
     24, Barrier
         Parallel region terminated
     25, Accelerator kernel generated
         Generating Tesla code
         25, Generating reduction(+:gpu_count)
         26, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         29, #pragma acc loop seq
     29, Loop carried scalar dependence for gpu_count at line 30
% a.out
52
52
52
52
52
gpu_count: 5