Is it safe for an OpenCL kernel to randomly write to a __global buffer?

Question

I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.

A simplistic example:

__kernel void test(__global int *a, __global int *hiddenCounter) {
    if (get_global_id(0) == 0) {
        // do stuff and then increment the appropriate counter (random numbers here)
        hiddenCounter[0] += 3;
    }
    else {
        // do stuff...
        hiddenCounter[1] += 5;
    }
}

After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter buffers and print the appropriate results.

My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global arguments and I need to change it to __private? Will I be able to aggregate __private buffers from the host code afterwards?

pmdj pmdj · Accepted Answer · 2019-11-08T12:20:00

My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer

The answer to this is emphatically yes, your code will be vulnerable to race conditions as currently written.

Do I need to enforce some kind of synchronization?

Yes, you can use global atomics for this purpose. All but the most ancient GPUs will support this. (anything supporting OpenCL 1.2, or cl_khr_global_int32_base_atomics and similar extensions)

Note that this will have a non-trivial performance overhead. Depending on your access patterns and frequency, collecting intermediate results in private or local memory and writing them out to global memory at the end of the kernel may be faster. (In the local case, the whole work group would share just one global atomic call for each updated cell - you'll need to use local atomics or a reduction algorithm to accumulate the values from individual work items across the group though.)

Another option is to use a much larger global memory buffer, with counters for each work item or group. In that case, you will not need atomics to write to them, but you will subsequently need to combine the values on the host. This uses much more memory, obviously, and likely more memory bandwidth too - modern GPUs should cache accesses to your small hiddenCounter buffer. So you'll need to work out/try which is the lesser evil in your case.

Is it safe for an OpenCL kernel to randomly write to a __global buffer?

1 Answers