0
votes

I am currently trying to implement an OpenCL kernel. The kernel is supposed to output a number of previously calculated elements divided by the total number of elements remapped to a value from 0 to 255.

The kernel runs in a single work group with 256 work items where LX is the local ID:

#define LX get_local_id(0)

kernel void reduceStatistic(global int *inout, int nr_workgroups, int nr_pixels)
{
    int i = 1;
    for (; i < nr_workgroups; i++)
    {
        inout[LX] += inout[LX + i * 256];
    }

    inout[LX] = (int)floor(((float)inout[LX] / (float)nr_pixels) * 256.0f);
}

The calculation before the remapping operation is for clean up after a previous calculation on the same buffer.

The first item of inout[LX] after the cleanup is 17176, the nr_pixels is 160000 so this should result in a value of 27 using the calculation above. The code, however, returns 6.

The relevant host-side code is as follows:

// nr_workgroups is of type int
cl_mem outputBuffer = clCreateBuffer(mgr->context, CL_MEM_READ_WRITE, nr_workgroups * 256 * sizeof(cl_int), NULL, NULL);

// another kernel writes into outputBuffer

// set kernel arguments
clSetKernelArg(mgr->reduceStatisticKernel, 0, sizeof(outputBuffer), &outputBuffer);
clSetKernelArg(mgr->reduceStatisticKernel, 1, sizeof(cl_int), &nr_workgroups);
clSetKernelArg(mgr->reduceStatisticKernel, 2, sizeof(cl_int), &imgSeqSize);

size_t global_work_size_statistics[1] = { 256 };
size_t local_work_size_statistics[1] = { 256 };

// run the kernel
clEnqueueNDRangeKernel(mgr->commandQueue, mgr->reduceStatisticKernel, 1, NULL, global_work_size_statistics, local_work_size_statistics, 0, NULL, NULL);

// read result
cl_int *reducedResult = new cl_int[256];
clEnqueueReadBuffer(mgr->commandQueue, outputBuffer, CL_TRUE, 0, 256 * sizeof(cl_int), reducedResult, 0, NULL, NULL);

Help much appreciated! (:

1
Are you sure you really mean i <= nr_workgroups, shouldn't that be i < nr_workgroups? And shouldn’t the stride in inout[LX + i * 265]; be 256, not 265? - pmdj
the 265 there is wrong, the <= should be right. thanks! :D this does not change the issues with the result though, it outputs 0 now :/ - vakyas
Whether <= is right depends on how big inout is, which you haven't specified in the question. I assumed it might be nr_workgroups * 256 elements, in which case < would be correct. - pmdj
Could you check if you didn't accidentally create an out of order queue? Just making sure that the kernel that fills the buffer actually completes first. Also, if you run 2WGs, you'll be over-writing memory. You may have meant to use get_global_id() - silverclaw
commandQueue = clCreateCommandQueue(context, devices[deviceNo], CL_QUEUE_PROFILING_ENABLE, &status); - vakyas

1 Answers

0
votes

We established in the comments that the global buffer index calculation is wrong:

    inout[LX] += inout[LX + i * 265];
                      ----------^^^
                      Should be 256

Going out of range on a buffer leads to undefined behaviour, so this is always one of the prime culprits to look for.