dynamic allocation in shared memory in opencl on Nvidia

Question

I'm following the example here to create a variable-length local memory array. The kernel signature is something like this:

__kernel void foo(__global float4* ex_buffer,
                 int ex_int,
                 __local void *local_var)

Then I call clSetKernelArg for the local memory kernel argument as follows:

clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)

Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE. Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.

This is a simple kernel that generates the mentioned error (local work size is 32):

__kernel 
void s_Kernel(const unsigned int N, __local void *shared_mem_block ) 
{
    const ushort thread_id = get_local_id(0);   
    __local double *foo = shared_mem_block;
    __local ushort *bar = (__local ushort *) &(foo[1000]);


    foo[thread_id] = 0.;
    bar[thread_id] = 0;
}

The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?

What is the actual value of MaxSharedMem that you are using/getting? What happens if you reduce the MaxSharedMem variable to some smaller value, say 1000? — Robert Crovella
Host is c++. MaxSharedMem is 48000 for the card I'm testing on, which is quadro K1000M. I tried setting it to a smaller value, but this had no effect. The only thing that allowed me to avoid the CL_INVALID_COMMAND_QUEUE error was statically allocating all my data structures in local mem, which is hardly workable for my kernel. — tavr
Perhaps you should show a short complete example that is not working. Here is my sample, modified slightly from the vector add sample here which seems to run correctly on an NVIDIA K40c. — Robert Crovella

tavr tavr · Accepted Answer · 2017-03-19T02:29:03

For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.

As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).

Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.

Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.

dynamic allocation in shared memory in opencl on Nvidia

1 Answers