What guarantees for CUDA CC3.x - all threads of one Warp or only of half-Warp always synchronized?

Question

What guarantees for CUDA CC3.x:

All threads of one Warp always synchronized?
All threads of one Half-Warp (but not the whole Warp) is always synchronized?

Ie when happen divergence of execution across the branches of conditional branch (if, switch, ...) do threads of first half-Warp go to one branch, and do threads of second Half-Warp go to another branch - simultaneously in a single moment, if they are both from the same single Warp?

Or the second half-Warp threads will be inactive(disabled) and will wait for the completion of the first half-Warp (first branch), and then for the second branch are contrary - swapped, first half-Warp will be disabled and wait for the eompletion of the second half-Warp (second branch), even if the divergence occurs across exactly half-Warp (exactly 16 threads)?

if(threadId.x < 16) { branch_1(); }
else { branch_2(); }

As said here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-3-0

Then, at every instruction issue time, each scheduler issues two independent instructions for one of its assigned warps that is ready to execute, if any.

Does this mean that it can be two independent instructions from different branches (1 and 2) for each half-Warp, or it only mean that it can be two independent instructions located consecutively in single branch for whole Warp?

There is no such thing as half-warps in the abstract CUDA execution model. On earlier hardware, the memory controller design issued transactions per half-warp, but that has nothing to do with instruction divergence — talonmies

Robert Crovella Robert Crovella · Accepted Answer · 2014-01-21T14:18:13

The concept of half-warp applies to devices of cc1.x only. Perhaps you are simply referring to a part of a warp.
There is in fact no guarantee in the CUDA programming model that threads of a warp will be executed in lockstep. However all extant implementations currently do this, and so there is a considerable code base that takes advantage of warp-synchronous behavior.
Upon entry into a possibly divergent control structure (e.g. if-then-else) the conditional test will be performed for all threads in the warp. If necessary, the warp will then be partitioned into the threads that satisfy the then path and the threads that satisfy the else path. All threads begin executing one of the two paths, but the threads that did not satisfy that path remain idle (perform no operations.) When the execution of that path is complete, the warp is restarted down the other path, and the (previously idle) threads will now execute the remaining path while the (previously active) threads remain idle. This is a general description of the behavior in divergent control flow situations, and it roughly lines up with your paragraph that begins with Or the second half-Warp threads will be inactive(disabled) ... but I would not use the term half-Warp to describe it, as that generally means something with respect to cc1.x devices.
For devices capable of multiple instruction issue, the instructions issued to the warp will be from the currently executing path (then or else not both).

What guarantees for CUDA CC3.x - all threads of one Warp or only of half-Warp always synchronized?

1 Answers