I am using OpenCL C++ for the implementation of my project. I want to get the maximum speed/performance out of my GPU/s (depending on whether I have multiple GPUs or a single one). But for the purpose of this question, lets assume I have only one device.
Suppose I have an array of length 100.
double arr[100];
Now what currently I am doing is that I am calling the kernel through the following method.
kernelGA(cl::EnqueueArgs(queue[iter],
cl::NDRange(100)),
d_arr, // and some other buffers.
)
Now at the kernel side. I have one global id. that is:
int idx = get_global_id(0);
The way I want my kernel is to work is the following:
- Each of the 100 work groups will take care of one element each.
There are some rules with using which each work group is updating the element of the array. eg:
if (arr[idx] < 5) { arr[idx] = 10; // a very simple example. }
For most of the parts, it is okay. But then there is one point where I want to interchange and where I want the threads/work items to communicate with each other. At that point, they don't seem to work and they don't seem to communicate.
eg:
if(arr[idx] < someNumber) {
arr[idx] = arr[idx + 1];
}
At this point, nothing seems to work. I tried to implement a for loop and to create a barrier
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
but it also doesn't work. It doesn't change the values of the array elements.
I have the following questions:
1. Why doesn't it work? Is my implementation wrong? The threads seem to update their own indexed array element correctly. But when it comes to communication between them, they don't work. Why?
2. Is my implementation of the barriers and letting only one work item wrong? Is there a better way to let one item take care of this part while the other items are waiting for this one to finish?