Templated CUDA kernel with dynamic shared memory

Question

I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write:

template<typename T>
__global__ void kernel(T* ptr)
{
  extern __shared__ T smem[];
  // calculations here ...                                                                                                                                          
}

template<typename T>
void call_kernel( T* ptr, const int n )
{
  dim3 dimBlock(n), dimGrid;
  kernel<<<dimGrid, dimBlock, n*sizeof(T)>>>(ptr);
}

int main(int argc, char *argv[])
{
  const int n = 32;
  float *float_ptr;
  double *double_ptr;
  cudaMalloc( (void**)&float_ptr, n*sizeof(float) );
  cudaMalloc( (void**)&double_ptr, n*sizeof(double) );

  call_kernel( float_ptr, n );
  call_kernel( double_ptr, n ); // problem, 2nd instantiation

  cudaFree( (void*)float_ptr );
  cudaFree( (void*)double_ptr );
  return 0;
}

However, this code cannot be compiled. nvcc gives me the following error message:

main.cu(4): error: declaration is incompatible with previous "smem"
(4): here
          detected during:
            instantiation of "void kernel(T *) [with T=double]"
(12): here
            instantiation of "void call_kernel(T *, int) [with T=double]"
(24): here

I understand that I am running into a name conflict because the shared memory is declared as extern. Nevertheless there is no way around that if I want to define its size during runtime, as far as I know.

So, my question is: Is there any elegant way to obtain the desired behavior? With elegant I mean without code duplication etc.

Possibly an oversight in the CUDA compiler, as this is otherwise allowed in C++ (without the __shared__ qualifier). — user703016

Robert Crovella Robert Crovella · Accepted Answer · 2014-12-19T17:12:01

Dynamically allocated shared memory is really just a size (in bytes) and a pointer being set up for the kernel. So something like this should work:

replace this:

extern __shared__ T smem[];

with this:

extern __shared__ __align__(sizeof(T)) unsigned char my_smem[];
T *smem = reinterpret_cast<T *>(my_smem);

You can see other examples of re-casting of dynamically allocated shared memory pointers in the programming guide which can serve other needs.

EDIT: updated my answer to reflect the comment by @njuffa.

Templated CUDA kernel with dynamic shared memory

2 Answers