0
votes

From what I understand about Kepler GPUs, and CUDA in general, is that when a single SMX unit works on a block, it launches warps which are groups of 32 threads. Now here are my questions:

1) If the SMX unit can work on 64 warps, that means there is a limit of 32x64 = 2048 threads per SMX unit. But Kepler GPUs have 4 warp schedulers, so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel? And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32? This is of course, ignoring any divergence or even cases where something like a global memory access can cause a warp to stall and have the scheduler switch to another.

2) If the above is correct, is the best possible outcome for a single SMX unit to work on 128 threads simultaneously? And for something like a GTX Titan that has 14 SMX units, a total of 128x14 = 1792 threads? I see numbers online that says otherwise. That a Titan can run 14x64 (max warps per SMX) x32(threads per SMX) = 28,672 concurrently. How can that be is SMX units launch warps, and only have 4 warp schedulers? They cannot launch all 2048 threads per SMX at once? Maybe I'm confused as to the definition of the maximum number of threads the GPU can launch concurrently, with what you are allowed to queue?

I appreciate answers and clarification on this.

1
where is your programming related effort?Enamul Hassan

1 Answers

3
votes

so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel?

Instructions from up to 4 warps can be scheduled in any given clock cycle on a kepler SMX. However due to pipelines in execution units, in any given clock cycle, instructions may be in various stages of pipeline execution from any and up to all warps currently resident on the SMX.

And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32?

I'm not sure how you jumped from the previous point to this one. Since instruction mix presumably varies from warp to warp (since different warps are presumably at different points in the instruction stream) and instruction mix varies from one place to another in the instruction stream, I don't see any logical connection between 4 warps schedulable in a given clock cycle, and any need to have groups of 4 warps. A given warp may be at a point where its instructions are highly schedulable (perhaps at a sequence of SP FMA, requiring SP cores, which are plentiful), and another 3 warps may be at another point in the instruction stream where their instructions are "harder to schedule" (perhaps requiring SFUs, which there are fewer of). Therefore arbitrarily grouping warps into sets of 4 doesn't make much sense. Note that we don't require divergence for warps to get out of sync with each other. The natural behavior of the scheduler coupled with the varying availability of execution resources could create warps that were initially together, to be at different points in the instruction stream.

For your second question, I think your fundamental knowledge gap is in understanding how a GPU hides latency. Suppose a GPU has a set of 3 instructions to issue across a warp:

LD R0, a[idx]
LD R1, b[idx]
MPY R2, R0, R1

The first instruction is a LD from global memory, and it can be issued and does not stall the warp. The second instruction likewise can be issued. The warp will stall at the 3rd instruction, however, due to latency from global memory. Until R0 and R1 become properly populated, the multiply instruction cannot be dispatched. Latency from main memory prevents it. The GPU deals with this problem by (hopefully) having a ready supply of "other work" it can turn to, namely other warps in an unstalled state (i.e. that have an instruction that can be issued). The best way to facilitate this latency-hiding process is to have many warps available to the SMX. There isn't any granularity to this (such as needing 4 warps). Generally speaking, the more threads/warps/blocks that are in your grid, the better chance the GPU will have of hiding latency.

So it is true that the GPU cannot "launch" 2048 threads (i.e. issue instructions from 2048 threads) in a single clock cycle. But when a warp stalls, it is put into a waiting queue until the stall condition is lifted, and until then, it is helpful to have other warps "ready to go", for the next clock cycle(s).

GPU latency hiding is a commonly misunderstood topic. There are many available resources to learn about it if you search for them.