GPU programming model - how many simultaneous, divergent threads without penalty

Question

I am new to GPGPU and CUDA. From my reading, on current-generation CUDA GPU's, threads get bundled into warps of 32 threads. All threads in a warp execute the same instructions so if there is divergence in branches all threads essentially take the time corresponding to taking all the incurred branches. However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources. So my question is, how many concurrent warps can be so executed where divergence doesn't cause this penality... i.e. what number is it that I should look for in the spec sheet. Is it the number of "shader processors" or the number of "Streaming multiprocessors" that is relevant here?

Also, the same question for AMD Radeon: Here the relevant terms might be "unified shaders" and "compute units".

Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.

Dragontamer5788 Dragontamer5788 · Accepted Answer · 2017-10-19T23:43:37

I am new to GPGPU and am instead learning OpenCL. But this question has remained unanswered for months, so I'll have a stab at it (and hopefully an expert will correct me if I'm wrong).

However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources

Not necessarily. On AMD systems, only 64 work-items (called Threads in CUDA) are worked on at any given time (technically: each VALU in AMD systems works on 16 items at once, but any given instruction is repeated four-times, every time. So 64-items per "AMD Wavefront"). On NVidia systems, it seems like 32-threads are executed at a time per warp.

Of course, the "Block Size" is likely far larger than 64. So if you were doing 32x32 pixel blocks, you'd need 1024 cores / shaders / work items per work group (OpenCL) or Warp.

These 1024 threads CAN diverge without penalty under NVidia Pascal, because they're split into sets of 32.

So if you have a work group / warp size of 1024, correlating to 32x32 block of pixels... the first two rows will execute on one VALU (AMD GCN) or SM (NVidia Pascal). As long as ALL of those 32 threads / 64-work items take the same branches, you won't have any penalties.

Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.

if( threadid> 0) {
} else {
    dostuff();
}

Honestly, I think its best if you just diverge and hope for the best. All of those cores have resources of their own (Registers and stuff).

GPU programming model - how many simultaneous, divergent threads without penalty

1 Answers