2
votes

I'm using global atomics to synchronize between work groups in OpenCL.

So the kernel uses a code like

... global volatile uint* counter;

if(get_local_id(0) == 0) {
    while(*counter != expected_value);
}
barrier(0);

To wait until counter becomes expected_value.

And at another place it does

if(get_local_id(0) == 0) atomic_inc(counter);

Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly. But if one work group starts only after another has completely finished, then the kernel can deadlock.

On CPU and on GPU (NVidia CUDA platform), it seems to always work, with a large number of work groups (over 8000).

For the algorithm this seems to be the most efficient implementation. (It does a prefix sums over each line in a 2D buffer.)

Does OpenCL and/or NVidia's OpenCL implementation guarantee that this always works?

1
The counter is already volatile (required for atomic_inc), so the fence should not be necessary - tmlen
In OpenCL 1.2 there seems to be no atomic_load - tmlen

1 Answers

3
votes

Does OpenCL and/or NVidia's OpenCL implementation guarantee that this always works?

As far as the OpenCL standard is concerned, this is not guaranteed (similarly for CUDA). Now, in practice, it may very well work due to your specific OpenCL implementation, but bear in mind that it's not guaranteed by the standard, so make sure you understand your implementation's execution model to ensure this is safe, and that such code won't necessarily be portable across other conforming implementations.

Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly

OpenCL states that work groups can run in any order, and not necessarily in parallel nor even concurrently. CUDA has similar wording, although CUDA 9 does support a form of grid-wise synchronization.

OpenCL spec, 3.2.2 Execution Model: Execution of kernel-instances:

A conforming implementation may choose to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute in parallel. There is no safe and portable way to synchronize across the independent execution of work-groups since once in the work-pool, they can execute in any order.