I need to synchronize the computation of 2D array. One element of array is data block of 4x4 int. Each block depends on above block like blk[Y][X] should be compute after blk[Y-1][X]. I try to synchronize by using global map. Map is 2D array of volatile __global ints w/ corresponding size so kernel set value of map[Y][X] after blk[Y][X] has been done.
Each kernel check flag above to know it could start like
__kernel
void kernel_blk_4x4(
__global __read_only uchar * __restrict _src,
__global __write_only uchar * __restrict _dst,
volatile __global int * __restrict map
){
int gidY = get_global_id(1);
int gidX = get_global_id(0);
// --- check flags before starting
volatile int kkk = 0;
volatile __global int * const map_ptr0 = map + (gidY)*31 + gidX;
volatile __global int * const map_ptr1 = map_ptr0 + 1;
volatile int val = *map_ptr0;
while(val == 0) {
kkk++;
val = *map_ptr0;
}
computation here...
volatile __global int *map_ptr = map + (gidY+1)*31 + gidX;
*map_ptr = 1;
}
for first row map already field by 1 so in theory it should work...
Real life is more interesting...
Actually, I get deadlock. But if I add "printf();" somewhere in code, for example show map array before or after while() everything is work fine...
Is any ideas what I'm doing wrong?
Thanks for you help!
EDIT: Synchronization was archived :) But another questions have appeared. Answer: I changed approach. Every thread scans the map and take one block to proceed as only one is ready. Note that 8x8 is only to test.
__global int *map_ptr = map;
int val = 0;
while (1) {
for(int y=0; y<8; y++) {
for(int x=0; x<8; x++) {
val = atomic_cmpxchg(map_ptr+x, 1, 2);
if(val == 1) {
map_ptr += x;
break;
}
}
if(val == 1) break;
map_ptr+=stride;
}
if(val == 1) break;
map_ptr = map;
}
// do some work
__global int *map_next = map_ptr+stride;
atomic_inc(map_next);
This kernel works well in case work group size is 1x1. When I try to change the size, I get clinch.
What might be a reason of it?