Enqueuing to device side queue in a loop

Question

In my code I have kernelA and kernelB. kernelB depends on kernelA results. I am iterating over this kernels tousand of times and each iteration depends on the results from the previous iteration.

The host side enqueue code snipped is like this:

for(int x = 0; x < iterations; ++x)
{
    queue.enqueueNDRangeKernel(kernelA, cl::NullRange, cl::NDRange(3*256, 1), cl::NDRange(256, 1));
    queue.enqueueNDRangeKernel(kernelB, cl::NullRange, cl::NDRange(256, 1), cl::NDRange(256, 1));
}
queue.finish();

The above code is working perfectly fine.

Now I want to port the above code to use device side enqueue and I'm facing issues on AMD GPU. The kernel code:

__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelA(...){}

__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelB(...){}

__attribute__((reqd_work_group_size(1, 1, 1)))
__kernel void kernelLauncher(...)
{
    queue_t default_queue = get_default_queue();
    clk_event_t ev1, ev2;

    for (int x = 0; x < iterations; ++x)
    {
        void(^fnKernelA)(void) = ^{ kernelA(
        ... // kernel params come here
            ); };

        if (x == 0)
        {
            enqueue_kernel(default_queue,
                CLK_ENQUEUE_FLAGS_NO_WAIT,
                ndrange_1D(3 * 256, 256),
                0, NULL, &ev1,
                fnKernelA);
        }
        else
        {
            enqueue_kernel(default_queue,
                CLK_ENQUEUE_FLAGS_NO_WAIT,
                ndrange_1D(3 * 256, 256),
                1, &ev2, &ev1, // ev2 sets dependency on kernelB here
                fnKernelA);
        }

        void(^fnKernelB)(void) = ^{ kernelB(
        ... // kernel params come here
            ); };

        enqueue_kernel(default_queue,
            CLK_ENQUEUE_FLAGS_NO_WAIT,
            ndrange_1D(256, 256),
            1, &ev1, &ev2,  // ev1 sets dependency on kernelA here
            fnKernelB);
    }
}

The host code:

queue.enqueueNDRangeKernel(kernelLauncher, cl::NullRange, cl::NDRange(1, 1), cl::NDRange(1, 1));

The issue is that the results returned from the kernel when run on AMD GPU are wrong. Sometimes kernel also hangs which may indicate that there is probably something wrong with kernel synchronization. The same code works fine on Intel CPU, not sure if that is a luck or there is something wrong with synchronization points in the kernel.

Update: enqueue_kernel is failing on 1025th enqueue command with error -1. I tried to get more detailed error (added -g during build) but to no avail. I increased the device queue size to maximum but that didn't change anything (still failing on 1025th enqueue command). Removing content of kernelA and kernelB didn't change anything either. Any thoughts?

In your first code snippet (enqueuing from the host), the work-group size for kernelA is set to 3*256. In the second snipped (enqueuing from the device), the work-group size of kernelA is set to 256? Are these supposed to be the same? — jprice
@jprice kernelA and kernelB for both version is exactly the same. kernelA is launched as 3 workgroups of 256 work items in both cases. Note that kernel side enqueue is set to ndrange_1D(3 * 256, 256). — doqtor
OK, but the code for your host enqueue at the top of your question launches one work-group of 3*256 work-items for kernelA, instead of 3 work-groups of 256 work-items. — jprice
@jprice, yep, you are right, that was just a typo which I made when preparing the snipped, sorry about that. — doqtor
Fair enough. Can't see anything else wrong with the code. Does it produce correct results if you just run a single iteration? Have you tried checking whether the enqueue_kernel functions are always returning CLK_SUCCESS? — jprice

kurtzmarc kurtzmarc · Accepted Answer · 2020-03-05T13:52:59

Answering an old question to hopefully save someone time in the future. If you query CL_DEVICE_MAX_ON_DEVICE_EVENTS on your device it will return 1024. That is the max number of events you can queue "on device". That is why it is failing on the 1025 queue. If you run your OpenCL code on a different GPU (like Intel) you may be lucky enough to get a real error code back which will be CLK_DEVICE_QUEUE_FULL or -161. AMD ignores the -g option and doesn't ever seem to give anything back but -1 on a failed on-device enqueue.

Enqueuing to device side queue in a loop

1 Answers