2
votes

I'm writing an OpenCL program, however my global work size is not a multiple of my local work size. In OpenCL global work size must be divisible by local work size, so a solution I read was to add a few extra work items that do nothing to round up the size of the global work size and make it divisible by the chosen local work size.

For example, say local work size is 4 and global work size is 62 (you have 62 elements that need operations done on them by the kernel)

The idea here would be to add 2 more work-items that simply idle, in order to make global work size 64. Thus, as 64 is divisible by 4, all is well.

Any ideas on how exactly to implement idle work-items like this? If I simply increase global work size to 64, I have two extra executions of my kernel that changes the result of the computation the program is doing, ultimately producing a mistaken result.

2
You would have to add bound checks inside the kernel which make sure only those threads perform computation whose global index falls inside valid data range. - sgarizvi

2 Answers

2
votes

It is a standard approach to round-up the global work size to a multiple of local work size. In this case, we have to add bound checks inside the kernel to make sure only those work items perform computation which fall inside the valid data range. It can be done by specifying the actual data size as a kernel parameter and comparing it with the global index of work item. An example kernel will look like this:

__kernel void example_kernel(__global int* input, __global int* output, int dataSize)
{
    int index = get_global_id(0);

    if (index < dataSize)
    {
        /*
        rest of the kernel...
        */
    }
}
1
votes

OpenCL 2.0 onward, it's no longer required to have global work sizes multiple of local work sizes.

It is better to leave local work sizes NULL unless there is real performance benefit.

You can round down gws and do extra processing in edge work iters

gws = (old_gws/lws) * lws;
leftover = old_gws - gws;

In kernel

if(get_global_id(0) == (get_global_size(0)-1))
    // do computation for rest of the work-items (leftover)