2
votes

Is there any relation between the size of the shared memory and the maximum number of threads per block?. In my case I use Max threads per block = 512, my program makes use of all the threads and it uses considerable amount of shared memory.

Each thread has to do a particular task repeatedly. For example my kernel might look like,

 int threadsPerBlock = (blockDim.x * blockDim.y * blockDim.z);
 int bId = (blockIdx.x * gridDim.y * gridDim.z) + (blockIdx.y * gridDim.z) + blockIdx.z;
 for(j = 0; j <= N; j++) {
     tId = threadIdx.x + (j * threadsPerBlock);
     uniqueTid = bId*blockDim.x + tId;
     curand_init(uniqueTid, 0, 0, &seedValue);
     randomP = (float) curand_uniform( &seedValue );
     if(randomP <= input_value) 
          /* Some task */
     else
          /* Some other task */
  }  

But my threads are not going into next iteration (say j = 2). Am i missing something obvious here?

2
How does your program uses the "considerable amount of shared memory"? It's not clear how the code explains your question. - einpoklum

2 Answers

2
votes

You have to distinct between shared memory and global memory. The former is always per block. The latter refers to the off-chip memory that is available on the GPU.

So generally speaking, there is a kind of relation when it comes to threads, i.e. when having more threads per block, the maximum amount of shared memory stays the same.

Also refer to e.g. Using Shared Memory in CUDA C/C++.

2
votes

There is no immediate relationship between the maximum number of threads per block and the size of the shared memory (not 'device memory' - they're not the same thing).

However, there is an indirect relationship, in that with different Compute Capabilities, both these numbers change:

Compute Capability 1.x 2.x - 3.x
Threads per block 512 1024
Max shared memory (per block) 16KB 48KB

as one of them has increased with newer CUDA devices, so has the other.

Finally, there is a block-level resource which is affected, used up, by the launching of more threads: The Register File. There is a single register file which all block threads share, and the constraint is

ThreadsPerBlock x RegistersPerThread <= RegisterFileSize

It is not trivial to determine how many registers your kernel code is using; but as a rule of thumb, if you use "a lot" of local variables, function call parameters etc., you might hit the above limit, and will not be able to schedule as many threads.