1
votes

I need to synchronize the computation of 2D array. One element of array is data block of 4x4 int. Each block depends on above block like blk[Y][X] should be compute after blk[Y-1][X]. I try to synchronize by using global map. Map is 2D array of volatile __global ints w/ corresponding size so kernel set value of map[Y][X] after blk[Y][X] has been done.

Each kernel check flag above to know it could start like

__kernel
    void kernel_blk_4x4(
        __global __read_only uchar *    __restrict _src,
        __global __write_only uchar *   __restrict _dst,
        volatile __global int * __restrict map
                      ){
int gidY = get_global_id(1);
int gidX = get_global_id(0);

// --- check flags before starting
volatile int kkk = 0;
volatile __global int * const map_ptr0 = map + (gidY)*31 + gidX;
volatile __global int * const map_ptr1 = map_ptr0 + 1;

volatile int val = *map_ptr0;

while(val == 0) {
    kkk++;
    val = *map_ptr0;
}
computation here...

volatile __global int *map_ptr = map + (gidY+1)*31 + gidX;
*map_ptr = 1;
}


for first row map already field by 1 so in theory it should work...
Real life is more interesting...
Actually, I get deadlock. But if I add "printf();" somewhere in code, for example show map array before or after while() everything is work fine...


Is any ideas what I'm doing wrong?

Thanks for you help!

EDIT: Synchronization was archived :) But another questions have appeared. Answer: I changed approach. Every thread scans the map and take one block to proceed as only one is ready. Note that 8x8 is only to test.

__global int *map_ptr = map;
int val = 0;
while (1) {
   for(int y=0; y<8; y++) {
       for(int x=0; x<8; x++) {
          val = atomic_cmpxchg(map_ptr+x, 1, 2);
          if(val == 1) {
              map_ptr += x;
              break;
          }
       }
       if(val == 1) break;
       map_ptr+=stride;
   }
   if(val == 1) break;
   map_ptr = map;
}

// do some work

__global int *map_next = map_ptr+stride;
atomic_inc(map_next);

This kernel works well in case work group size is 1x1. When I try to change the size, I get clinch.

What might be a reason of it?

1
It simply cannot be done. You will always get deadlock except for 1 simple WI per WG case, I dunno what the printf() is doing, but maybe is serializing the code somehow as well. But that is not an useful use case. Change your algorithm to a full parallel one instead. I would recommend to launch a batch of kernels one per row, and operate fully parallel on the column dimension. BTW post full code for better help. (what is gidY/gidX, what is map, arguments, etc) - DarkZeros
@DarkZeros, thanks for your answer. I post whole kernel w/o computations themselves (currently I comment it on my side too). Unfortunately, I can't do it by row, there is simplified version of code, indeed there are a few more dependencies from above row. - user3124812
Even though OpenCL has the keyword volatile it doesn't mean that global memory would be synced across workgroups. The spec says that different workgroups reading and writing the same global memory area is undefined. So you might see the reads, or you might not, or you might get garbage. - sharpneli
There is no guarantee of the order that work groups get processed; they can be out of order and serial or in parallel or any combination. Any attempt to synchronize between work groups is a recipe for problems. You should instead enqueue multiple kernels which does guarantee execution order. - Dithermaster
@user3124812 I don't understand, you have ´_src´ and ´_dst´ as read and write only variables. Then why do you need that the block blk[Y-1][X] is executed before blk[Y][X]?. There is no possible use of blk[Y-1][X] in the blk[Y][X], since all the global data is read and write only. If you are doing this just for the sake of pleasure and ordered execution, you shouldn't, adding additional constrains and syncs to the OpenCL execution will only slow it down terrible. - DarkZeros

1 Answers

0
votes

The OpenCL on GPU devices uses SIMD execution (Single Instruction, Multiple Data). This means that all the working threads in a group will execute the same code ALWAYS, since the HW does not allow it in any other way.

In this SIMD case, it is not possible to put a "break", "for", "while" etc... or any loop operation that is not encountered the same amount of times by all the working threads in a group.

Since your code explicitly needs that ONLY 1 threads exits the loop, it will never work. It will enter a deadlock state, at least on a SIMD device (all the GPUs that exist).

The only way to do this, is to have only 1 WI per WG (highly inefficient), or use CPU. Your question cannot be answered in any other way, OpenCL cannot do what you need. There is no trick or piece of code that can solve your problem. You have to change the algorithm you are using to make it really parallel.

NOTE: Using 1x1 WG, or printf() seriallized the execution, which solves your problem. But in this case there is no sense in using GPU at all.