9
votes

I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write:

template<typename T>
__global__ void kernel(T* ptr)
{
  extern __shared__ T smem[];
  // calculations here ...                                                                                                                                          
}

template<typename T>
void call_kernel( T* ptr, const int n )
{
  dim3 dimBlock(n), dimGrid;
  kernel<<<dimGrid, dimBlock, n*sizeof(T)>>>(ptr);
}

int main(int argc, char *argv[])
{
  const int n = 32;
  float *float_ptr;
  double *double_ptr;
  cudaMalloc( (void**)&float_ptr, n*sizeof(float) );
  cudaMalloc( (void**)&double_ptr, n*sizeof(double) );

  call_kernel( float_ptr, n );
  call_kernel( double_ptr, n ); // problem, 2nd instantiation

  cudaFree( (void*)float_ptr );
  cudaFree( (void*)double_ptr );
  return 0;
}

However, this code cannot be compiled. nvcc gives me the following error message:

main.cu(4): error: declaration is incompatible with previous "smem"
(4): here
          detected during:
            instantiation of "void kernel(T *) [with T=double]"
(12): here
            instantiation of "void call_kernel(T *, int) [with T=double]"
(24): here

I understand that I am running into a name conflict because the shared memory is declared as extern. Nevertheless there is no way around that if I want to define its size during runtime, as far as I know.

So, my question is: Is there any elegant way to obtain the desired behavior? With elegant I mean without code duplication etc.

2
Possibly an oversight in the CUDA compiler, as this is otherwise allowed in C++ (without the __shared__ qualifier).user703016

2 Answers

16
votes

Dynamically allocated shared memory is really just a size (in bytes) and a pointer being set up for the kernel. So something like this should work:

replace this:

extern __shared__ T smem[];

with this:

extern __shared__ __align__(sizeof(T)) unsigned char my_smem[];
T *smem = reinterpret_cast<T *>(my_smem);

You can see other examples of re-casting of dynamically allocated shared memory pointers in the programming guide which can serve other needs.

EDIT: updated my answer to reflect the comment by @njuffa.

5
votes

(A variation on @RobertCrovella's answer)

NVCC is not willing to accept two extern __shared__ arrays of the same name but different types - even if they're never in each other's scope. We'll need to satisfy NVCC by having our template instances all use the same type for the shared memory under the hood, while letting the kernel code using them see the type it likes.

So we replace this instruction:

extern __shared__ T smem[];

with this one:

auto smem = shared_memory_proxy<T>();

where:

template <typename T>
__device__ T* shared_memory_proxy()
{
    // do we need an __align__() here? I don't think so...
    extern __shared__ unsigned char memory[];
    return reinterpret_cast<T*>(memory);
}

is in some device-side code include file.

Advantages:

  • One-liner at the site of use.
  • Simpler syntax to remember.
  • Separation of concerns - whoever reads the kernel doesn't have to think about why s/he's seeing extern, or alignment specifiers, or a reinterpret cast etc.

edit: This is implemented as part of my CUDA kernel author's tools header-only library: shared_memory.cuh (where it's named shared_memory::dynamic::proxy() ).