CUDA shared memory under the hood questions

Question

I have several questions regarding to CUDA shared memory.

First, as mentioned in this post, shared memory may declare in two different ways:

Either dynamically shared memory allocated, like the following

   // Lunch the kernel 
   dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);

This may use inside a kernel as mention:

   extern __shared__ int s[];

Or static shared memory, which can use in kernel call like the following:

   __shared__ int s[64];

Both are use for different reasons, however which one is better and why ?

Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:

 __global__ void startKernel(float* p_d_array) 
 {
    __shared double matA[3*3];

    float a1 =0 ; 
    float a2 = 0;
    float a3 = 0; 
    float b = p_d_array[threadidx.x]; 

    a1 += reduce( b, threadidx.x); 
    a2 += reduce( b, threadidx.x); 
    a3 += reduce( b, threadidx.x); 

    // continue... 
  }

  __device__ reduce ( float data , unsigned int tid) 
 {
     __shared__ float data[256]; 
      // do reduce ...   
 }

I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.

What's happening when block # 0 goes into reduce function?

Does the shared memory is allocated in advance to the function call?

I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?

Even though if they are called from two different function calls ?

On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.

Third, is that possible to gain higher performance in shared memory due to compiler optimization using

  const float* p_shared ;

or restrict keyword after the data assignment section?

You use statically declared shared memory when the size of needed shared is known at compile time. You use dinamically allocated shared memory when the size of needed shared memory is known at runtime. — Vitality

Eugene Eugene · Accepted Answer · 2014-06-24T15:50:42

AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.

Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.

CUDA shared memory under the hood questions

1 Answers