I have the task of large number of threads running, each doing a small matrix multiplication. All the small matrices have been loaded to the global memory. I wish to improve performance by letting each thread load its small matrices into shared memory, and then compute the product. But the problem is that I do not know the sizes of the matrices during compile time. So I cannot create variables as in __shared__ double mat1[XSIZE][YSIZE]
. On PC, I would have made a dynamic allocation. But I do not know if I could do it on the shared memory. If calling malloc in a kernel would allocate only in global memory (assuming such a call is possible), that does not help either.
Is there a way to declare arrays during runtime in kernel? Is there any other way to resolve this problem?