In my code I have kernelA and kernelB. kernelB depends on kernelA results. I am iterating over this kernels tousand of times and each iteration depends on the results from the previous iteration.
The host side enqueue code snipped is like this:
for(int x = 0; x < iterations; ++x)
{
queue.enqueueNDRangeKernel(kernelA, cl::NullRange, cl::NDRange(3*256, 1), cl::NDRange(256, 1));
queue.enqueueNDRangeKernel(kernelB, cl::NullRange, cl::NDRange(256, 1), cl::NDRange(256, 1));
}
queue.finish();
The above code is working perfectly fine.
Now I want to port the above code to use device side enqueue and I'm facing issues on AMD GPU. The kernel code:
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelA(...){}
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelB(...){}
__attribute__((reqd_work_group_size(1, 1, 1)))
__kernel void kernelLauncher(...)
{
queue_t default_queue = get_default_queue();
clk_event_t ev1, ev2;
for (int x = 0; x < iterations; ++x)
{
void(^fnKernelA)(void) = ^{ kernelA(
... // kernel params come here
); };
if (x == 0)
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
0, NULL, &ev1,
fnKernelA);
}
else
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
1, &ev2, &ev1, // ev2 sets dependency on kernelB here
fnKernelA);
}
void(^fnKernelB)(void) = ^{ kernelB(
... // kernel params come here
); };
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(256, 256),
1, &ev1, &ev2, // ev1 sets dependency on kernelA here
fnKernelB);
}
}
The host code:
queue.enqueueNDRangeKernel(kernelLauncher, cl::NullRange, cl::NDRange(1, 1), cl::NDRange(1, 1));
The issue is that the results returned from the kernel when run on AMD GPU are wrong. Sometimes kernel also hangs which may indicate that there is probably something wrong with kernel synchronization. The same code works fine on Intel CPU, not sure if that is a luck or there is something wrong with synchronization points in the kernel.
Update: enqueue_kernel is failing on 1025th enqueue command with error -1. I tried to get more detailed error (added -g during build) but to no avail. I increased the device queue size to maximum but that didn't change anything (still failing on 1025th enqueue command). Removing content of kernelA and kernelB didn't change anything either. Any thoughts?
3*256. In the second snipped (enqueuing from the device), the work-group size of kernelA is set to256? Are these supposed to be the same? - jpricekernelAandkernelBfor both version is exactly the same.kernelAis launched as 3 workgroups of 256 work items in both cases. Note that kernel side enqueue is set tondrange_1D(3 * 256, 256). - doqtor3*256work-items forkernelA, instead of3work-groups of256work-items. - jpriceenqueue_kernelfunctions are always returningCLK_SUCCESS? - jprice