I am new to GPGPU and CUDA. From my reading, on current-generation CUDA GPU's, threads get bundled into warps of 32 threads. All threads in a warp execute the same instructions so if there is divergence in branches all threads essentially take the time corresponding to taking all the incurred branches. However, it seems that different warps executing simultaneously on the GPU can have divergent branches without this cost since the different warps are executed by separate computational resources. So my question is, how many concurrent warps can be so executed where divergence doesn't cause this penality... i.e. what number is it that I should look for in the spec sheet. Is it the number of "shader processors" or the number of "Streaming multiprocessors" that is relevant here?
Also, the same question for AMD Radeon: Here the relevant terms might be "unified shaders" and "compute units".
Finally, suppose I have a workload that is highly divergent across threads so I essentially just want one thread per warp. Essentially using the GPU as an ordinary multi-core CPU. Is that possible and how should I lay out the threads and thread-blocks for this to happen? Can I avoid allocating memory etc. for the 31 redundant threads in the warp. I realize this might not be the ideal workload for GPGPU but it would be usable for running an activity in the background without blocking the host CPU.