Depending on the dimensions of your block the first condition threadIdx.x < 64
(note the .x
) may not cause any divergence at all. For example, if you have a block with dimensions (128,1,1)
then the first two warps (32-threads groups which execute in lock-step) will enter into the if
block while the last two will bypass it. Since the whole warp goes one way or the other there is no divergence.
A conditional like threadIdx.x == 1
will cause divergence, but it will have very modest cost. Indeed, in many cases CUDA will be able to implement the conditional expression with a single instruction. For instance, operations like min
, max
, and abs
will generally be implemented with a single instruction and cause absolutely no divergence. You can read about such instructions in the PTX Manual.
In general you should not be overly concerned about modest amounts of control-flow divergence like the above. Where divergence will bite you in in situations like
if (threadIdx.x % 4 == 0)
// do expensive operation
else if (threadIdx.x % 4 == 1)
// do expensive operation
else if (threadIdx.x % 4 == 2)
// do expensive operation
else
// do expensive operation
where an "expensive operation" would be one that required 10s or 100s of instructions. In this case the divergence caused by the if
statements would reduce efficiency by 75%.
Keep in mind that thread divergence is a much lesser concern than (1) high-level algorithm choices and (2) memory locality/coalescing. Very few CUDA programmers should ever be concerned with the sort of divergence in your examples.