Maximum number of blocks and threads working in parallel for a shared variable

Question

Considering a GPU kernel function to be executed on the K2000 GPU card (compute capability 3.0) is shown below:

#define TILE_DIM 64
__global__ void PerformSomeOperations(float* g_A, float* g_B)
{
    __shared__ float BlockData[TILE_DIM][TILE_DIM];
    // Some Operation to be performed
}

How can I determine the maximum number of blocks and threads that can execute in parallel on a single multiprocessor? Also if I have N blocks does this mean that the shared memory for each block will be divided by N?

The theoretical occupancy as determined by shared memory for your example would be given by the total shared memory used per block divided by the total shared memory available. This would give the maximum number of threadblocks that can be simultaneously resident on an SM. The total shared memory available per SM is 48KB, assuming you have the cache config set that way (prefer shared) and some small number of shared memory bytes will be used up in kernel launch overhead. Number of threads will just be threads per block times blocks. — Robert Crovella

Sagar Masuti Sagar Masuti · Accepted Answer · 2013-11-01T05:32:00

You can run the devicequery example from the sample to determine the max number of blocks. HERE Inside a each block you can have maximum 1024 threads.

How many blocks executing on a SM(Streaming multiprocessor)? Each SM can have upto 16 active blocks on Kepler and 8 active blocks on Fermi.

Also you need to think in terms of warps. One warp = 32 threads. In a Fermi, the number of active warps is 48 and in Kepler its 64 per SM. These are ideal numbers. The actual number of warps executing on a SM will depend on the Launch configuration and number of resources you are using in a kernel.

Usually you will calculate occupancy = active warps / number of max active warps.

If you have N blocks then the total shared memory is divided by N. If you want to have large number of blocks then you may want to check the occupancy calculator spread sheet to check how much of shared memory you can use without affecting the performance.

But,

__shared__ float BlockData[TILE_DIM][TILE_DIM];

is allocated per block so you have the whole chunk available in each block.

Maximum number of blocks and threads working in parallel for a shared variable

1 Answers