I know OpenCL is fairly idle these days - especially NVidia's CUDA implementation. That said I think I've found a significant bug in Nvidia and I'd like to see if anybody else notices the same. Using Linux Platform Version OpenCL 1.2 CUDA 10.1.0 with C++ bindings I've been having all kinds of issues with NDRange order and I finally have a simple kernel that can definitively reproduce the issue:
void kernel test()
{
printf("G0:%d G1:%d G2:%d L0:%d L1:%d L2:%d\n",
get_global_id(0),
get_global_id(1),
get_global_id(2),
get_local_id(0),
get_local_id(1),
get_local_id(2));
}
If I enqueue this kernel with 3 dimensions: global (4,3,2) and local (1,1,1):
queue.enqueueNDRangeKernel(kernel, cl::NullRange,
cl::NDRange(4, 3, 2),
cl::NDRange(1, 1, 1),
NULL, events);
it randomly outputs the following correctly on AMD/Intel (random output sorted for clarity):
G0:0 G1:0 G2:0 L0:0 L1:0 L2:0
G0:0 G1:0 G2:1 L0:0 L1:0 L2:0
G0:0 G1:1 G2:0 L0:0 L1:0 L2:0
G0:0 G1:1 G2:1 L0:0 L1:0 L2:0
G0:0 G1:2 G2:0 L0:0 L1:0 L2:0
G0:0 G1:2 G2:1 L0:0 L1:0 L2:0
G0:1 G1:0 G2:0 L0:0 L1:0 L2:0
G0:1 G1:0 G2:1 L0:0 L1:0 L2:0
G0:1 G1:1 G2:0 L0:0 L1:0 L2:0
G0:1 G1:1 G2:1 L0:0 L1:0 L2:0
G0:1 G1:2 G2:0 L0:0 L1:0 L2:0
G0:1 G1:2 G2:1 L0:0 L1:0 L2:0
G0:2 G1:0 G2:0 L0:0 L1:0 L2:0
G0:2 G1:0 G2:1 L0:0 L1:0 L2:0
G0:2 G1:1 G2:0 L0:0 L1:0 L2:0
G0:2 G1:1 G2:1 L0:0 L1:0 L2:0
G0:2 G1:2 G2:0 L0:0 L1:0 L2:0
G0:2 G1:2 G2:1 L0:0 L1:0 L2:0
G0:3 G1:0 G2:0 L0:0 L1:0 L2:0
G0:3 G1:0 G2:1 L0:0 L1:0 L2:0
G0:3 G1:1 G2:0 L0:0 L1:0 L2:0
G0:3 G1:1 G2:1 L0:0 L1:0 L2:0
G0:3 G1:2 G2:0 L0:0 L1:0 L2:0
G0:3 G1:2 G2:1 L0:0 L1:0 L2:0
This follows the spec. But if I schedule the exact same kernel with same dimensions using NVidia I the following output:
G0:0 G1:0 G2:0 L0:0 L1:0 L2:0
G0:0 G1:0 G2:0 L0:0 L1:1 L2:0
G0:0 G1:0 G2:1 L0:0 L1:0 L2:0
G0:0 G1:0 G2:1 L0:0 L1:1 L2:0
G0:0 G1:0 G2:2 L0:0 L1:0 L2:0
G0:0 G1:0 G2:2 L0:0 L1:1 L2:0
G0:1 G1:0 G2:0 L0:0 L1:0 L2:0
G0:1 G1:0 G2:0 L0:0 L1:1 L2:0
G0:1 G1:0 G2:1 L0:0 L1:0 L2:0
G0:1 G1:0 G2:1 L0:0 L1:1 L2:0
G0:1 G1:0 G2:2 L0:0 L1:0 L2:0
G0:1 G1:0 G2:2 L0:0 L1:1 L2:0
G0:2 G1:0 G2:0 L0:0 L1:0 L2:0
G0:2 G1:0 G2:0 L0:0 L1:1 L2:0
G0:2 G1:0 G2:1 L0:0 L1:0 L2:0
G0:2 G1:0 G2:1 L0:0 L1:1 L2:0
G0:2 G1:0 G2:2 L0:0 L1:0 L2:0
G0:2 G1:0 G2:2 L0:0 L1:1 L2:0
G0:3 G1:0 G2:0 L0:0 L1:0 L2:0
G0:3 G1:0 G2:0 L0:0 L1:1 L2:0
G0:3 G1:0 G2:1 L0:0 L1:0 L2:0
G0:3 G1:0 G2:1 L0:0 L1:1 L2:0
G0:3 G1:0 G2:2 L0:0 L1:0 L2:0
G0:3 G1:0 G2:2 L0:0 L1:1 L2:0
It seems like NVidia's interpretation of global/local dimensions is interleaved which doesn't match spec. This doesn't seem to involve the C++ bindings either. Local ID should never be anything but zero and get_global_id(1) is always zero.
I know NVidia doesn't care much for OpenCL but this seems like a fairly major issue. Anyone else encounter something like this? This isn't a synch issue with printf. I've noticed it in actual data use cases and built this kernel only to demonstrate it.