There has been much discussion about how to choose the #blocks & blockSize, but I still missing something. Many of my concerns address this question: How CUDA Blocks/Warps/Threads map onto CUDA Cores? (To simplify the discussion, there is enough perThread & perBlock memory. Memory limits are not an issue here.)
kernelA<<<nBlocks, nThreads>>>(varA,constB, nThreadsTotal);
1) To keep the SM as busy as possible, I should set nThreads
to a multiple of warpSize
. True?
2) An SM can only execute one kernel at a time. That is all HWcores of that SM are executing only kernelA. (Not some HWcores running kernelA, while others run kernelB.) So if I have only one thread to run, I'm "wasting" the other HWcores. True?
3)If the warp-scheduler issues work in units of warpSize
(32 threads), and each SM has 32 HWcores, then the SM would be full utilized. What happens when the SM has 48 HWcores? How can I keep all 48 cores full utilized when the scheduler is issuing work in chunks of 32? (If the previous paragraph is true, wouldn't it be better if the scheduler issued work in units of HWcore size?)
4) It looks like the warp-scheduler queues up 2 tasks at a time. So that when the currently-executing kernel stalls or blocks, the 2nd kernel is swapped in. (It is not clear, but I'll guess the queue here is more than 2 kernels deep.) Is this correct?
5) If my HW has an upper limit of 512 threads-per-block (nThreadsMax), that doesn't mean the kernel with 512 threads will run fastest on one block. (Again, mem not an issue.) There is a good chance I'll get better performance if I spread the 512-thread kernel across many blocks, not just one. The block is executed on one or many SM's. True?
5a) I'm thinking the smaller the better, but does it matter how small I make nBlocks
? The question is, how to choose the value of nBlocks
that is decent? (Not necessarily optimal.) Is there a mathematical approach to choosing nBlocks
, or is it simply trial-n-err.