2
votes

From my understanding of NVIDIA's CUDA architecture, the execution of threads happens in groups of ~32 called 'warps'. Multiple warps are scheduled at a time, and instructions are issued from any of the warps (depending on some internal algorithm).

Now, if I have say 16KB of shared memory on the device, and each thread uses 400 bytes of shared memory, then one warp will need 400*32 = 12.8 KB. Does this mean that the GPU cannot actually schedule more than 1 warp at a time, irrespective of how many threads I launch within a given block?

1
Shortly, the resource is only allocated to warps/blocks once they become active. And if the kernel can satisfy the compiler to find at least one active block, then you are good to go.Zk1001

1 Answers

3
votes

From a resource standpoint (registers, shared memory, etc.) the important unit is the threadblock, not the warp.

In order to schedule a threadblock for execution, there must be enough free resources on the SM to cover the needs of the entire threadblock. All threadblocks in a grid will have exactly the same resource requirements.

If the SM has no currently executing threadblocks, (such as at the point of kernel launch) then the SM should have at least enough resources to cover the needs of a single threadblock. If that is not the case, the kernel launch will fail. This could happen, for example, if the number of registers per thread, times the number of threads per block, exceeded the number of registers in the SM.

After the SM has a single threadblock scheduled, additional threadblocks can be scheduled depending on the available resources. So to extend the register analogy, if each threadblock required 30K registers (regs/thread * threads/block), and the SM had a max of 64K register, then at most two threadblocks could be scheduled (i.e. their warps could possibly be brought into execution by the SM).

In this way, any warp that could possibly be brought into execution already has enough resources allocated for it. This is a principal part of the scheduling mechanism that allows the SM to switch execution from one warp to another with zero delay (fast context switching).