This question arises from the differences between the theorical and achieved occupancy observed in a kernel. I'm aware of that different occupancy between calculator and nvprof and also of A question about the details about the distribution from blocks to SMs in CUDA.
Let consider a GPU with a compute capability = 6.1 and 15 SMs (GTX TITAN, Pascal Architecture, Chipset GP104). And let consider a small problem size of 2304 elements.
If we configure a kernel with 512 threads, so each thread will process one element, we need 5 blocks to manipulate all the data. And the kernel is so small that there are not any limit in the use of resources, regarding registers or shared memory.
The theorical occupancy is therefore 1 because four concurrent blocks can be allocated in one SM (2048 threads) leading to 2048 / 32 = 64 active warps (maximum value).
However, the achieved occupancy (reported by the nvidia profiler) is ~0.215 and it is probably related to the way blocks are mapped into the SMs. So, how are the blocks scheduled into the SMs in CUDA when their number is lesser than the available SMs?
Option 1.- schedule 4 blocks of 512 threads into one SM and 1 blocks of 512 in another SM. In this case, the occupancy will be (1 + 0.125) / 2 = 0.56. I supposed that the last block has only 256 of 512 threads active to reach the last 256 elements of the array and it is allocated in the second SM. So only 8 warps are active considering warp granularity.
Option 2.- schedule each block of 512 to a different SMs. As we have 15 SMs, why saturate only one SM with many blocks?. In this case we have 512 / 32 = 16 active warps per SMs (except the last one, which has only 256 active threads). So, we have 0.25 occupancy achieved in four SMs and 0.125 in the last one, leading to (0.25 + 0.25 + 0.25 + 0.25 + 0.125) / 5 = 0.225.
Option 2 is closer to the occupancy reported by the visual profiler and in our opinion is what is happening behind the scenes. Anyway, it's worthy ask it: How are the blocks scheduled into the SMs in CUDA when their number is lesser than the available SMs? Is it documented?
-- Please note this is not a homework. It's a real scenario in a project using different third party libraries having small number of elements to be processed in some steps of the pipeline composed of multiple kernels.
clock64()
type delay to force blocks to persist for a while. – Robert Crovella