I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.
A simplistic example:
__kernel void test(__global int *a, __global int *hiddenCounter) {
if (get_global_id(0) == 0) {
// do stuff and then increment the appropriate counter (random numbers here)
hiddenCounter[0] += 3;
}
else {
// do stuff...
hiddenCounter[1] += 5;
}
}
After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter buffers and print the appropriate results.
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global arguments and I need to change it to __private? Will I be able to aggregate __private buffers from the host code afterwards?