Maximum threads per block vs shared memory size

Question

Is there any relation between the size of the shared memory and the maximum number of threads per block?. In my case I use Max threads per block = 512, my program makes use of all the threads and it uses considerable amount of shared memory.

Each thread has to do a particular task repeatedly. For example my kernel might look like,

 int threadsPerBlock = (blockDim.x * blockDim.y * blockDim.z);
 int bId = (blockIdx.x * gridDim.y * gridDim.z) + (blockIdx.y * gridDim.z) + blockIdx.z;
 for(j = 0; j <= N; j++) {
     tId = threadIdx.x + (j * threadsPerBlock);
     uniqueTid = bId*blockDim.x + tId;
     curand_init(uniqueTid, 0, 0, &seedValue);
     randomP = (float) curand_uniform( &seedValue );
     if(randomP <= input_value) 
          /* Some task */
     else
          /* Some other task */
  }

But my threads are not going into next iteration (say j = 2). Am i missing something obvious here?

How does your program uses the "considerable amount of shared memory"? It's not clear how the code explains your question. — einpoklum

Sebastian Dressler Sebastian Dressler · Accepted Answer · 2014-01-02T11:51:28

You have to distinct between shared memory and global memory. The former is always per block. The latter refers to the off-chip memory that is available on the GPU.

So generally speaking, there is a kind of relation when it comes to threads, i.e. when having more threads per block, the maximum amount of shared memory stays the same.

Also refer to e.g. Using Shared Memory in CUDA C/C++.

Compute Capability	1.x	2.x - 3.x
Threads per block	512	1024
Max shared memory (per block)	16KB	48KB

Maximum threads per block vs shared memory size

2 Answers