Do modern nVIDIA GPUs perform sub-warp scheduling of work?

Question

In recent nVIDIA GPU uarchitectures, a single streaming multiprocessor seems to be broken up into 4 sub-units; with each of them having horizontal or vertical 'bars' of 8 'squares', corresponding to different functional units: integer ops, 32-bit flops, 64-bit flops, and load/store. A single warp scheduler seems to be associated with each such "quarter-SM".

Now, in the CUDA programming model, the threads of each warp (= 32 threads) are instruction-locked together. However, when actually executing work, and in a situation where, say, only the second half or latter quarter of the threads in a warp are active - can these sub-warps be scheduled to 2 or 3 quarter-SMs, with the other quarter-SM doing some other work?

currently, in CUDA, all instructions, regardless of active-masking or predication, even on volta, are scheduled warp-wide. — Robert Crovella
@RobertCrovella: So is the 'partition' in the diagram merely describing a "geometrical" arrangement of the chips? Or is the whole-warp scheduling a kind of a choice which theoretically could have been made differently? — einpoklum
None of this is a remotely new idea. The original G80 had 8 "cores" per SM, and two SMs per TPC sharing a texture and memory interface with a warp size of 32 and warp level scheduling. So there were various combinations of "quarter warp" and "half warp" transactions to retire a single instruction. But always with warp wide instruction scheduling/issue — talonmies

einpoklum einpoklum · Accepted Answer · 2018-01-05T16:08:26

No, they don't.

Based on Robert's comments, sub-warp scheduling does not happen - scheduling is always of full warps (at least as far as anyone using the chip is concerned). Internally it may or may not be the case that sub-warp scheduling is possible.

Do modern nVIDIA GPUs perform sub-warp scheduling of work?

1 Answers

No, they don't.