4
votes

say I have 3 share memory array: a b c. I am not sure if following thread arrangement will cause control divergence or not,

if (threadIdx < 64)
{
    if (threadIdx == 1)
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*a[threadIdx];
    else
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*b[threadIdx];
}

if it does, how bad is it gonna affect performance? is there any efficient way to handle the problem? thanks

2

2 Answers

11
votes

Depending on the dimensions of your block the first condition threadIdx.x < 64 (note the .x) may not cause any divergence at all. For example, if you have a block with dimensions (128,1,1) then the first two warps (32-threads groups which execute in lock-step) will enter into the if block while the last two will bypass it. Since the whole warp goes one way or the other there is no divergence.

A conditional like threadIdx.x == 1 will cause divergence, but it will have very modest cost. Indeed, in many cases CUDA will be able to implement the conditional expression with a single instruction. For instance, operations like min, max, and abs will generally be implemented with a single instruction and cause absolutely no divergence. You can read about such instructions in the PTX Manual.

In general you should not be overly concerned about modest amounts of control-flow divergence like the above. Where divergence will bite you in in situations like

if (threadIdx.x % 4 == 0)
  // do expensive operation
else if (threadIdx.x % 4 == 1)
  // do expensive operation
else if (threadIdx.x % 4 == 2)
  // do expensive operation
else
  // do expensive operation

where an "expensive operation" would be one that required 10s or 100s of instructions. In this case the divergence caused by the if statements would reduce efficiency by 75%.

Keep in mind that thread divergence is a much lesser concern than (1) high-level algorithm choices and (2) memory locality/coalescing. Very few CUDA programmers should ever be concerned with the sort of divergence in your examples.

8
votes

If there is more than one thread per block, I would expect divergence in one warp of each block (whichever block holds thread 1).

But, the difference between your two loops is only in which memory to access, not in instructions. So, I would do this instead...

if (threadIdx.x < 64)
{
    //this conditional might diverge
    if (threadIdx.x == 1)
        ptr = a;
    else
        ptr = b;

    //but obviously this part will not
    for (int i = 0; i < N; i++)
        c += a[threadIdx]*ptr[threadIdx];
}