2
votes

I know OpenCL is fairly idle these days - especially NVidia's CUDA implementation. That said I think I've found a significant bug in Nvidia and I'd like to see if anybody else notices the same. Using Linux Platform Version OpenCL 1.2 CUDA 10.1.0 with C++ bindings I've been having all kinds of issues with NDRange order and I finally have a simple kernel that can definitively reproduce the issue:

void kernel test()
{
    printf("G0:%d   G1:%d   G2:%d   L0:%d   L1:%d   L2:%d\n", 
    get_global_id(0),
    get_global_id(1),
    get_global_id(2),
    get_local_id(0),
    get_local_id(1),
    get_local_id(2));
}

If I enqueue this kernel with 3 dimensions: global (4,3,2) and local (1,1,1):

queue.enqueueNDRangeKernel(kernel, cl::NullRange, 
                cl::NDRange(4, 3, 2), 
                cl::NDRange(1, 1, 1), 
                NULL, events);

it randomly outputs the following correctly on AMD/Intel (random output sorted for clarity):

G0:0   G1:0   G2:0   L0:0   L1:0   L2:0
G0:0   G1:0   G2:1   L0:0   L1:0   L2:0
G0:0   G1:1   G2:0   L0:0   L1:0   L2:0
G0:0   G1:1   G2:1   L0:0   L1:0   L2:0
G0:0   G1:2   G2:0   L0:0   L1:0   L2:0
G0:0   G1:2   G2:1   L0:0   L1:0   L2:0
G0:1   G1:0   G2:0   L0:0   L1:0   L2:0
G0:1   G1:0   G2:1   L0:0   L1:0   L2:0
G0:1   G1:1   G2:0   L0:0   L1:0   L2:0
G0:1   G1:1   G2:1   L0:0   L1:0   L2:0
G0:1   G1:2   G2:0   L0:0   L1:0   L2:0
G0:1   G1:2   G2:1   L0:0   L1:0   L2:0
G0:2   G1:0   G2:0   L0:0   L1:0   L2:0
G0:2   G1:0   G2:1   L0:0   L1:0   L2:0
G0:2   G1:1   G2:0   L0:0   L1:0   L2:0
G0:2   G1:1   G2:1   L0:0   L1:0   L2:0
G0:2   G1:2   G2:0   L0:0   L1:0   L2:0
G0:2   G1:2   G2:1   L0:0   L1:0   L2:0
G0:3   G1:0   G2:0   L0:0   L1:0   L2:0
G0:3   G1:0   G2:1   L0:0   L1:0   L2:0
G0:3   G1:1   G2:0   L0:0   L1:0   L2:0
G0:3   G1:1   G2:1   L0:0   L1:0   L2:0
G0:3   G1:2   G2:0   L0:0   L1:0   L2:0
G0:3   G1:2   G2:1   L0:0   L1:0   L2:0

This follows the spec. But if I schedule the exact same kernel with same dimensions using NVidia I the following output:

G0:0   G1:0   G2:0   L0:0   L1:0   L2:0
G0:0   G1:0   G2:0   L0:0   L1:1   L2:0
G0:0   G1:0   G2:1   L0:0   L1:0   L2:0
G0:0   G1:0   G2:1   L0:0   L1:1   L2:0
G0:0   G1:0   G2:2   L0:0   L1:0   L2:0
G0:0   G1:0   G2:2   L0:0   L1:1   L2:0
G0:1   G1:0   G2:0   L0:0   L1:0   L2:0
G0:1   G1:0   G2:0   L0:0   L1:1   L2:0
G0:1   G1:0   G2:1   L0:0   L1:0   L2:0
G0:1   G1:0   G2:1   L0:0   L1:1   L2:0
G0:1   G1:0   G2:2   L0:0   L1:0   L2:0
G0:1   G1:0   G2:2   L0:0   L1:1   L2:0
G0:2   G1:0   G2:0   L0:0   L1:0   L2:0
G0:2   G1:0   G2:0   L0:0   L1:1   L2:0
G0:2   G1:0   G2:1   L0:0   L1:0   L2:0
G0:2   G1:0   G2:1   L0:0   L1:1   L2:0
G0:2   G1:0   G2:2   L0:0   L1:0   L2:0
G0:2   G1:0   G2:2   L0:0   L1:1   L2:0
G0:3   G1:0   G2:0   L0:0   L1:0   L2:0
G0:3   G1:0   G2:0   L0:0   L1:1   L2:0
G0:3   G1:0   G2:1   L0:0   L1:0   L2:0
G0:3   G1:0   G2:1   L0:0   L1:1   L2:0
G0:3   G1:0   G2:2   L0:0   L1:0   L2:0
G0:3   G1:0   G2:2   L0:0   L1:1   L2:0

It seems like NVidia's interpretation of global/local dimensions is interleaved which doesn't match spec. This doesn't seem to involve the C++ bindings either. Local ID should never be anything but zero and get_global_id(1) is always zero.

I know NVidia doesn't care much for OpenCL but this seems like a fairly major issue. Anyone else encounter something like this? This isn't a synch issue with printf. I've noticed it in actual data use cases and built this kernel only to demonstrate it.

1
Correction: this IS a synchronization issue with printf.BoeroBoy

1 Answers

3
votes

Although it's hard to verify this in detail, I'll post it as an answer, because from my observations, it seems to explain the issue:

tl;dr: The reason is almost certainly due to the lack of synchronization in printf.


First of all, I observed the same behavior as you: On AMD the output seems to be right. On NVIDIA, it seems to be irritatingly wrong. So I was curious, and extended the kernel, to also print the get_local_size:

void kernel test()
{
    printf("G0:%d   G1:%d   G2:%d   L0:%d   L1:%d   L2:%d  S0:%d  S1:%d  S2:%d\n", 
        get_global_id(0),
        get_global_id(1),
        get_global_id(2),
        get_local_id(0),
        get_local_id(1),
        get_local_id(2),
        get_local_size(0),
        get_local_size(1),
        get_local_size(2));
}

Now, the get_local_id certainly must be smaller than the size, otherwise most kernels would just crash. On AMD, the output was nice and clean:

platform AMD Accelerated Parallel Processing
device Spectre
G0:0   G1:0   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:1   G1:0   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:2   G1:0   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:3   G1:0   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:0   G1:1   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:1   G1:1   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:2   G1:1   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:3   G1:1   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:0   G1:2   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:1   G1:2   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:2   G1:2   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:3   G1:2   G2:0   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:0   G1:0   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:1   G1:0   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:2   G1:0   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:3   G1:0   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:0   G1:1   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:1   G1:1   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:2   G1:1   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:3   G1:1   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:0   G1:2   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:1   G1:2   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:2   G1:2   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1
G0:3   G1:2   G2:1   L0:0   L1:0   L2:0  S0:1  S1:1  S2:1

On NVIDIA, the output was

platform NVIDIA CUDA
device GeForce GTX 970
G0:3   G1:0   G2:2   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:3   G1:0   G2:1   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:3   G1:0   G2:0   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:0   G1:0   G2:2   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:0   G1:0   G2:1   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:2   G1:0   G2:0   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:2   G1:0   G2:1   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:2   G1:0   G2:2   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:1   G1:0   G2:1   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:3   G1:0   G2:0   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:1   G1:0   G2:0   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:3   G1:0   G2:1   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:0   G1:0   G2:2   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:1   G1:0   G2:2   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:3   G1:0   G2:2   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:0   G1:0   G2:1   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:2   G1:0   G2:1   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:0   G1:0   G2:0   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:2   G1:0   G2:0   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:0   G1:0   G2:0   L0:0   L1:0   L2:0  S0:0  S1:0  S2:0
G0:2   G1:0   G2:2   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:1   G1:0   G2:2   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:1   G1:0   G2:1   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0
G0:1   G1:0   G2:0   L0:0   L1:1   L2:0  S0:0  S1:0  S2:0

Now, that cannot be right: The local work size is always 0!

After some further tests (e.g. with 2D kernels and different numbers), the output generally did not seem to make any sense, at all. So I tried this kernel:

void kernel test()
{
    printf("G0:%d\n", get_global_id(0));
    printf("G1:%d\n", get_global_id(1));
    printf("G2:%d\n", get_global_id(2));
    printf("L0:%d\n", get_local_id(0));
    printf("L1:%d\n", get_local_id(1));
    printf("L2:%d\n", get_local_id(2));
    printf("S0:%d\n", get_local_size(0));
    printf("S1:%d\n", get_local_size(1));
    printf("S2:%d\n", get_local_size(2));
}

The on NVIDIA, the output then is

platform NVIDIA CUDA
device GeForce GTX 970
G0:1
G0:1
G0:1
G0:2
G0:2
G0:2
G0:2
G0:2
G0:3
G0:2
G0:3
G0:3
G0:0
G0:3
G0:3
G0:0
G0:0
G0:3
G0:0
G0:0
G0:0
G0:1
G0:1
G0:1
G1:2
G1:2
G1:0
G1:0
G1:1
G1:2
G1:2
G1:1
G1:1
G1:1
G1:0
G1:0
G1:2
G1:1
G1:0
G1:0
G1:2
G1:1
G1:1
G1:0
G1:2
G1:2
G1:0
G1:1
G2:0
G2:0
G2:1
G2:1
G2:0
G2:0
G2:1
G2:0
G2:0
G2:0
G2:0
G2:0
G2:1
G2:1
G2:0
G2:1
G2:1
G2:1
G2:1
G2:0
G2:1
G2:0
G2:1
G2:1
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L0:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L1:0
L2:0
L1:0
L1:0
L1:0
L1:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
S0:1
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
L2:0
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S0:1
S1:1
S0:1
S0:1
S0:1
S0:1
S0:1
S1:1
S0:1
S0:1
S0:1
S0:1
S0:1
S1:1
S1:1
S1:1
S1:1
S1:1
S1:1
S1:1
S1:1
S1:1
S1:1
S2:1
S1:1
S1:1
S1:1
S2:1
S1:1
S1:1
S1:1
S1:1
S1:1
S2:1
S1:1
S1:1
S2:1
S2:1
S1:1
S1:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1
S2:1

The key point is: Each individual output is correct!. The problem seems to be that putting everything into a single printf messes up some internal buffer.

This is a pity, of course. It basically makes it impossible to use printf for the only purpose that it could sensibly be used for inside a kernel, namely for debugging...


An aside: The specifications remain a bit hard to interpret at that point - at least when it comes to deciding whether the observed behavior is "right" or "wrong". From the Khronos documentation of printf :

In the case that printf is executed from multiple work-items concurrently, there is no guarantee of ordering with respect to written data. For example, it is valid for the output of a work-item with a global id (0,0,1) to appear intermixed with the output of a work-item with a global id (0,0,4) and so on.

The NVIDIA documentation of the CUDA printf implementation also contains some disclaimers and talks about some buffers that may be overwritten, but mapping this (on the technical level of a specification) to the OpenCL behavior is difficult...