I am trying to "map" a few tasks to CUDA GPU. There are n tasks to process. (See the pseudo-code)
malloc an boolean array flag[n] and initialize it as false.
for each work-group in parallel do
while there are still unfinished tasks do
Do something;
for a few j_1, j_2, .. j_m (j_i<k) do
Wait until task j_i is finished; [ while(flag[j_i]) ; ]
Do Something;
end for
Do something;
Mark task k finished; [ flag[k] = true; ]
end while
end for
For some reason, I will have to use threads in different thread block.
The question is how to implement the Wait until task j_i is finished; and Mark task k finished; in CUDA. My implementation is to use an boolean array as the flag. Then set flag once a task is done, and read the flag to check if a task is done.
But it only works on small case, one large case, the GPU get crashed with unknown reason. Is there any better way to implement the Wait and Mark in CUDA.
That's basically a problem of inter-thread communication on CUDA.