- Let's say your GPU has 8 SMs. So if you execute a CUDA kernel with enough blocks(let's say 200), will all 8 SMs be used for the execution?
Now consider only a single SM. Let's assume there are 8 active blocks with 256 threads/block(8 warps/block). Max active warps=64.
Will the 8 active blocks process in parallel once the kernel started?
I know that the warps will be scheduled by the scheduler in each SM. Which means warps will not execute in parallel but concurrently.
Here is my real problem. I am experiencing a low latency issue with a particular kernel. Here is the limiting factors. I just want to know what is the optimum adjustments for this case. Because if the active blocks are not executing atleast concurrently, there is no point in increasing the active blocks count. Since, having a minimum number of active blocks with 64 active warps will perform better (ignore the register limitation as I can adjust it accordingly).