I'm using global atomics to synchronize between work groups in OpenCL.
So the kernel uses a code like
... global volatile uint* counter;
if(get_local_id(0) == 0) {
while(*counter != expected_value);
}
barrier(0);
To wait until counter becomes expected_value.
And at another place it does
if(get_local_id(0) == 0) atomic_inc(counter);
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly. But if one work group starts only after another has completely finished, then the kernel can deadlock.
On CPU and on GPU (NVidia CUDA platform), it seems to always work, with a large number of work groups (over 8000).
For the algorithm this seems to be the most efficient implementation. (It does a prefix sums over each line in a 2D buffer.)
Does OpenCL and/or NVidia's OpenCL implementation guarantee that this always works?
counteris alreadyvolatile(required foratomic_inc), so the fence should not be necessary - tmlenatomic_load- tmlen