I've working with openCL lately. I create a kernel that basically take one global variable shared by all the work-items in a kernel. The kernel can't be simpler, each work-item increment the value of result, which is the global variable. The code is shown.
__kernel void accumulate(__global int* result) {
*result = 0;
atomic_add(result, 1);
}
Every thing goes fine when the total number of work-items are small. On my MAC pro retina, the result is correct when the work-item is around 400.
However, as I increase the global size, such as, 10000. Instead of getting 10000 when getting back the number stored in result, the value is around 900, which means more than one work-item might access the global at the same time.
I wonder what could be the possible solution for this types of problem? Thanks for the help!