Working with work groups and their sizes in opencl on a single array

Question

I am using OpenCL C++ for the implementation of my project. I want to get the maximum speed/performance out of my GPU/s (depending on whether I have multiple GPUs or a single one). But for the purpose of this question, lets assume I have only one device.

Suppose I have an array of length 100.

double arr[100];

Now what currently I am doing is that I am calling the kernel through the following method.

kernelGA(cl::EnqueueArgs(queue[iter],
                         cl::NDRange(100)),
                         d_arr, // and some other buffers.
         )

Now at the kernel side. I have one global id. that is:

int idx = get_global_id(0);

The way I want my kernel is to work is the following:

Each of the 100 work groups will take care of one element each.
There are some rules with using which each work group is updating the element of the array. eg:
```
if (arr[idx] < 5) {
    arr[idx] = 10; // a very simple example.
}
```

For most of the parts, it is okay. But then there is one point where I want to interchange and where I want the threads/work items to communicate with each other. At that point, they don't seem to work and they don't seem to communicate.

eg:

if(arr[idx] < someNumber) {
    arr[idx] = arr[idx + 1];
}

At this point, nothing seems to work. I tried to implement a for loop and to create a barrier

barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

but it also doesn't work. It doesn't change the values of the array elements.

I have the following questions:

1. Why doesn't it work? Is my implementation wrong? The threads seem to update their own indexed array element correctly. But when it comes to communication between them, they don't work. Why?

2. Is my implementation of the barriers and letting only one work item wrong? Is there a better way to let one item take care of this part while the other items are waiting for this one to finish?

DarkZeros DarkZeros · Accepted Answer · 2017-03-04T19:47:33

The code you wrote is serial:

if(arr[idx] < someNumber) {
   arr[idx] = arr[idx + 1];
}

The worker N will take the result of worker N-1, N-1 the results of N-2, and so on. So it means worker N needs to wait for all the others to complete. Which means the code is not parallel and will never be. You are far better computing that with a CPU than a GPU.

OpenCL design model, allows you to run multiple work items in parallel but the synchronization model only allows to synchronize inside the work group. If you need global sync, is a clear sign that your algorithm is not for OpenCL.

Now, if I assume you just want the value of the last element. And what you really want is to perform a "sum" of all the array. Then, this is a reduction problem, and it is possible to perform it in log(N) time by parallelization in this fashion:

1st step,   array[x] = array[x] + array[N/2+x] (x from 0 to N/2)
2nd step,   array[x] = array[x] + array[N/4+x] (x from 0 to N/4)
...
log(N) passes

Each step will be a separate kernel, and therefore ensures all work items have finished before starting the next batch.

Another faster option is to perform reduction inside the work group, so if the work group size is 256, you can sum groups of 256 together in each pass. Which is faster than just reducing by 2 in each pass.

Working with work groups and their sizes in opencl on a single array

2 Answers