1
votes

I am using OpenCL C++ for the implementation of my project. I want to get the maximum speed/performance out of my GPU/s (depending on whether I have multiple GPUs or a single one). But for the purpose of this question, lets assume I have only one device.

Suppose I have an array of length 100.

double arr[100];

Now what currently I am doing is that I am calling the kernel through the following method.

kernelGA(cl::EnqueueArgs(queue[iter],
                         cl::NDRange(100)),
                         d_arr, // and some other buffers.
         )

Now at the kernel side. I have one global id. that is:

int idx = get_global_id(0);

The way I want my kernel is to work is the following:

  • Each of the 100 work groups will take care of one element each.
  • There are some rules with using which each work group is updating the element of the array. eg:

    if (arr[idx] < 5) {
        arr[idx] = 10; // a very simple example.
    }
    

For most of the parts, it is okay. But then there is one point where I want to interchange and where I want the threads/work items to communicate with each other. At that point, they don't seem to work and they don't seem to communicate.

eg:

if(arr[idx] < someNumber) {
    arr[idx] = arr[idx + 1];
}

At this point, nothing seems to work. I tried to implement a for loop and to create a barrier

barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

but it also doesn't work. It doesn't change the values of the array elements.

I have the following questions:

1. Why doesn't it work? Is my implementation wrong? The threads seem to update their own indexed array element correctly. But when it comes to communication between them, they don't work. Why?

2. Is my implementation of the barriers and letting only one work item wrong? Is there a better way to let one item take care of this part while the other items are waiting for this one to finish?

2

2 Answers

2
votes

The code you wrote is serial:

if(arr[idx] < someNumber) {
   arr[idx] = arr[idx + 1];
}

The worker N will take the result of worker N-1, N-1 the results of N-2, and so on. So it means worker N needs to wait for all the others to complete. Which means the code is not parallel and will never be. You are far better computing that with a CPU than a GPU.

OpenCL design model, allows you to run multiple work items in parallel but the synchronization model only allows to synchronize inside the work group. If you need global sync, is a clear sign that your algorithm is not for OpenCL.


Now, if I assume you just want the value of the last element. And what you really want is to perform a "sum" of all the array. Then, this is a reduction problem, and it is possible to perform it in log(N) time by parallelization in this fashion:

1st step,   array[x] = array[x] + array[N/2+x] (x from 0 to N/2)
2nd step,   array[x] = array[x] + array[N/4+x] (x from 0 to N/4)
...
log(N) passes

Each step will be a separate kernel, and therefore ensures all work items have finished before starting the next batch.

Another faster option is to perform reduction inside the work group, so if the work group size is 256, you can sum groups of 256 together in each pass. Which is faster than just reducing by 2 in each pass.

1
votes

I suspect that your problem represents a problem that has limited ability to be made parallel, and thus is a poor fit for any kind of GPGPU solution.

Consider the following array of elements:

1 5 2 6 5 3 6 7 2 8 1 8 3 4 2

Now suppose we perform the following transformation on this data:

//A straightforward, serially-executed iteration on the entire array.
for(int i = 0; i < arr.size() - 1; i++) {
    if(arr[i] < 5) arr[i] = arr[i + 1];
}

The result will then be

5 5 6 6 5 6 6 7 8 8 8 8 4 2 2

But what happens if the for loop executes in reverse?

for(int i = arr.size() - 2; i >= 0; i--) {
    if(arr[i] < 5) arr[i] = arr[i + 1];
}

The result will then be

5 5 6 6 5 6 6 7 8 8 8 8 2 2 2

Note how the third-to-last number is different depending on the order of execution. My example input doesn't change much, but if your code has lots of numbers below the threshold chosen, it could completely change the entire array! Because GPGPU APIs don't make guarantees about the order of execution of individual work items—which means your order of execution could be like the first for-loop I wrote, or it could be like the second for-loop I wrote, or it could be a completely randomly shuffled order—you've written non-deterministic code, and the only way to make it deterministic is to use so many barriers that you're guaranteeing sequential ordering, at which point, there's literally no reason to be using a GPGPU API in the first place.

You could write something like the following instead, which would be deterministic:

if(arr[i] < 5) output[i] = arr[i + 1];
else output[i] = arr[i];

But that might require a reconsideration of your design constraints. I don't know, as I don't know what your program is ultimately doing.

Either way though, you need to spend some time reconsidering what you're actually trying to do.