I'm writing an OpenCL program, however my global work size is not a multiple of my local work size. In OpenCL global work size must be divisible by local work size, so a solution I read was to add a few extra work items that do nothing to round up the size of the global work size and make it divisible by the chosen local work size.
For example, say local work size is 4 and global work size is 62 (you have 62 elements that need operations done on them by the kernel)
The idea here would be to add 2 more work-items that simply idle, in order to make global work size 64. Thus, as 64 is divisible by 4, all is well.
Any ideas on how exactly to implement idle work-items like this? If I simply increase global work size to 64, I have two extra executions of my kernel that changes the result of the computation the program is doing, ultimately producing a mistaken result.