What is the instruction issue time latency of the warp schedulers in CUDA?

Question

I am under the impression that the (single) warp scheduler in compute capability 1.x GPUs issues one instruction per warp every 4 cycles, and since the latency of the arithmetic pipeline is 24 cycles, it can be completely hidden by having 6 active warps at any one time.

For compute capability 2.1 GPUs, the Programming Guide mentions that "At every instruction issue time, each scheduler issues two independent instructions" while the post at How does the CUDA warp scheduler issue 2 instructions at a time for a warp? suggests that each scheduler can issue one instruction per warp per cycle.

So what is the exact latency of the warp scheduler? Every how many cycles an instruction is issued per warp? Is a different instruction (MIMD) being issued to any active and ready warp simultaneously?

This doesn't really have anything to do with C, I'd remove the tag. — Veltas

Robert Crovella Robert Crovella · Accepted Answer · 2013-08-27T22:51:57

Yes, there is one warp scheduler in a cc 1.x SM and for integer and single precision floating point operations it will issue an instruction over 4 clock cycles to service the entire warp.

There are two warp schedulers in a cc 2.x SM. Excerpting from the programming guide, we see that the behavior of these 2 schedulers is slightly different between cc 2.0 and cc 2.1:

At every instruction issue time, each scheduler issues:

•One instruction for devices of compute capability 2.0,

•Two independent instructions for devices of compute capability 2.1,

for some warp that is ready to execute, if any. The first scheduler is in charge of the warps with an odd ID and the second scheduler is in charge of the warps with an even ID. Note that when a scheduler issues a double-precision floating-point instruction, the other scheduler cannot issue any instruction. A warp scheduler can issue an instruction to only half of the CUDA cores. To execute an instruction for all threads of a warp, a warp scheduler must therefore issue the instruction over two clock cycles for an integer or floating-point arithmetic instruction.

What is the instruction issue time latency of the warp schedulers in CUDA?

1 Answers