OpenCL: Reduction examples, and retaining memory objects / converting cuda code to openCL

Question

I've been going through a few examples, reducing an array of elements to one element, without success. Someone posted this on an NVIDIA forum. I have changed from floating point variables to integers.

__kernel void sum(__global const short *A,__global unsigned long  *C,uint size, __local unsigned long *L) {
            unsigned long sum=0;
            for(int i=get_local_id(0);i<size;i+=get_local_size(0))
                    sum+=A[i];
            L[get_local_id(0)]=sum;

            for(uint c=get_local_size(0)/2;c>0;c/=2)
            {
                    barrier(CLK_LOCAL_MEM_FENCE);
                    if(c>get_local_id(0))
                            L[get_local_id(0)]+=L[get_local_id(0)+c];

            }
            if(get_local_id(0)==0)
                    C[0]=L[0];
            barrier(CLK_LOCAL_MEM_FENCE);
}

Does this look right? The third argument "size", is that supposed to be the local work size, or global work size?

I set up my arguments like this,

clSetKernelArg(ocReduce, 0, sizeof(cl_mem), (void*) &DevA);
clSetKernelArg(ocReduce, 1, sizeof(cl_mem), (void*) &DevC); 
clSetKernelArg(ocReduce, 2, sizeof(uint),   (void*) &size);  
clSetKernelArg(ocReduce, 3, LocalWorkSize * sizeof(unsigned long), NULL);

The first argument which is the input, I am trying to retain from the output of the kernel launched before it.

clRetainMemObject(DevA);
clEnqueueNDRangeKernel(hCmdQueue[Plat-1][Dev-1], ocKernel, 1, NULL, &GlobalWorkSize, &LocalWorkSize, 0, NULL, NULL);
//the device memory object DevA now has the data to be reduced

clEnqueueNDRangeKernel(hCmdQueue[Plat-1][Dev-1], ocReduce, 1, NULL, &GlobalWorkSize, &LocalWorkSize, 0, NULL, NULL);
clEnqueueReadBuffer(hCmdQueue[Plat-1][Dev-1],DevRE, CL_TRUE, 0, sizeof(unsigned long)*512,(void*) RE , 0, NULL, NULL);

Today I plan to try and convert the following cuda reduction example into openCL.

__global__ voidreduce1(int*g_idata, int*g_odata){
extern __shared__ intsdata[];

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();


for(unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}

// write result for this block to global mem
if(tid == 0) g_odata[blockIdx.x] = sdata[0];
}

There is a more optimized, (completely unrolled+multiple elements per thread).

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

Is this possible using openCL?

Grizzly gave me this advice the other day,

"...use a reduction kernel which operates on n element and reduces them to something like n / 16 (or any other number). Then you iteratively call that kernel until you are down to one element, which is your result"

I want to try this as well, but I don't exactly know where to start, and I want to first just get something to work.

Grizzly Grizzly · Accepted Answer · 2012-01-14T20:27:49

The first reduction code you gave should work as long as only one workgroup is working on the reduction (so get_global_size(0) == get_local_size(0)). In that case the size argument of the kernel would be the number of elements in A (which has no real correlation to either the global or the local worksize). While that is a workable solution, it seems inheriantly wasteful to let most of the gpu idle while doing the reduction, which is precisely why I proposed iteratively calling a reduction kernel. This would be made possible with only slight modifications to the code:

__kernel void sum(__global const short *A, __global unsigned long  *C, uint size, __local unsigned long *L) {
        unsigned long sum=0;
        for(int i=get_global_id(0); i < size; i += get_global_size(0))
                sum += A[i];
        L[get_local_id(0)]=sum;

        for(uint c=get_local_size(0)/2;c>0;c/=2)
        {
                barrier(CLK_LOCAL_MEM_FENCE);
                if(c>get_local_id(0))
                        L[get_local_id(0)]+=L[get_local_id(0)+c];

        }
        if(get_local_id(0)==0)
                C[get_group_id(0)]=L[0];
        barrier(CLK_LOCAL_MEM_FENCE);
}

Calling this with a GlobalWorkSize smaller then size (e.g. 4) will reduce the input in A by a factor of 4*LocalWorkSize, which can be iterated (by using the output buffer as input for the next call to sum with a different output buffer. Well actually that isn't quite true, since the second (and all following) iteration needs A to be of type global const unsigned long*, so you will actually need to kernels, but you get the idea.

Concerning the cuda reduction sample: Why would you bother converting it, it works basically exactly like the opencl version I posted above does, except reducing only by a hardcoded size per iteration (2*LocalWorkSize insted of size/GlobalWorkSize*LocalWorkSize).

Personally I use practically the same approach for the reduction, although I have split the kernel in two parts and only use the path using local memory for the last iteration:

__kernel void reduction_step(__global const unsigned long* A, __global unsigned long  * C, uint size) {
        unsigned long sum=0;
        for(int i=start; i < size; i += stride)
                sum += A[i];
        C[get_global_id(0)]= sum;
}

For the final step the full version which does reduction inside the work group was used. Of course you would need a second version of reduction step taking global const short* and this code is an untested adaption of your code (I can't post my own version, regretably). The advantage of this approach is the much lesser complexity of the kernel doing most of the work, and less amount of wasted work due to divergent branches. Which made it a bit faster then the other variant. However I have no results for either the newest compilerversion nor the newest hardware so that point might or might not be correct anymore (though I suspect it might since due to the reduced amount of divergent branches).

Now for the paper you linked in: It is certainly possible to use the optimizations suggested in that paper in opencl, except for the use of templates, which are not supported by opencl, so the blocksizes would have to be hardcoded. Of course the opencl version already does multiple adds per kernel and, if you follow the approach I mentioned above, would not really benefit from unrolling the reduction through local memory, since that is only done in the last step, which shouldn't take a significant part of the whole calculation time for a big enough imput. Furthermore I find the lack of synchronization in the unrolled implementation a bit troublesome. That only works because all threads going in that part belong to the same warp. This however isn't necessary true when executing on any hardware other then current nvidia cards (future nvidia cards, amd cards and cpus (although I think it should work for current amd cards and current cpu implementations, but I wouldn't necessarily count on it)), so I would stay away from that unless I needed the absolute last bit of speed for the reduction (and then still provide a generic version and switch to that if I don't recognize the hardware or something like that).

OpenCL: Reduction examples, and retaining memory objects / converting cuda code to openCL

2 Answers