16
votes

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.

__kernel void myKernel(
        const int length, 
        const int height, 
        and a bunch of other parameters) {

    //declare some local arrays to be shared by all 100 work item in this group
    __local float LP [length];
    __local float LT [height];
    __local int bitErrors = 0;
    __local bool failed = false;

    //here come my actual computations which utilize the space in LP and LT
}

This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?

All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...

3

3 Answers

27
votes

It's relatively simple, you can pass the local arrays as arguments to your kernel:

kernel void myKernel(const int length, const int height, local float* LP, 
                     local float* LT, a bunch of other parameters) 

You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:

clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 2, height* sizeof(cl_float), NULL);

local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.

Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)

As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.

2
votes

You could also declare your arrays like this:

__local float LP[LENGTH];

And pass the LENGTH as a define in your kernel compile.

int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
1
votes

You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.

The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)


ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.