1
votes

I have the task of large number of threads running, each doing a small matrix multiplication. All the small matrices have been loaded to the global memory. I wish to improve performance by letting each thread load its small matrices into shared memory, and then compute the product. But the problem is that I do not know the sizes of the matrices during compile time. So I cannot create variables as in __shared__ double mat1[XSIZE][YSIZE]. On PC, I would have made a dynamic allocation. But I do not know if I could do it on the shared memory. If calling malloc in a kernel would allocate only in global memory (assuming such a call is possible), that does not help either.

Is there a way to declare arrays during runtime in kernel? Is there any other way to resolve this problem?

1

1 Answers

5
votes

You can declare dynamically sized shared memory allocations in CUDA, like this

__global__ void kernel()
{
    extern __shared__ double *mat1;
}

And then launch your kernel like this

kernel<<<grid,block,XSIZE*YSIZE*sizeof(double)>>>();

This is discussed in more detail in the CUDA programming guide.