1
votes

So I have a compute shader kernel with the following logic:

[numthreads(64,1,1)] 
void CVProjectOX(uint3 t : SV_DispatchThreadID){ 
  
    if(t.x >= TotalN) 
       return; 

    uint compt = DbMap[t.x]; 

    ....

I do understand that it's not ideal to have ifs elses/branching in compute shaders? if so, what is the best way to limit thread work if number of total expected threads aren't expected to match exactly the kernel's numthreads?

For instance in my example, the kernel group of 64 threads, let's say I expect total 961 threads (it could be anything really), if, I dispatch 960, 1 db slot won't be processed, if I dispatch 1024, there will be 63 unnecessary work or maybe work pointing to non-existing db slot. (db slots number will vary).

Is if(t.x > TotalN)/return fine and the right approach here? Should I just do min, tx = min(t.x, TotalN) and keep writing on the final db slot? Should I just modulo? tx = t.x % TotalN and rewrite the first db slots?

What other solutions?

1

1 Answers

2
votes

Limiting the number of threads this way is fine, yes. But, be aware that an early return like this doesn't actually save (as much) work as you'd expect:

The hardware utilizes SIMD like thread collections (called wavefonts in directX). Depending on the hardware, the usual size of such a wavefont is usually 4 (Intel iGPUs), 32 (NVidia and most AMD GPUs) or 64 (a few AMD GPUs). Due to the nature of SIMD, all threads in such a wavefont always do exactly the same work, you can only "mask out" some of them (meaning, their writes will be ignored and they are fine reading out-of-bounds memory).

This means that, in the worst case (when the wavefont size is 64), when you need to execute 961 threads and are therefore dispatching 1024, there will still be 63 threads executing the code, they just behave like they wouldn't exist. If the wave size is smaller, the hardware might at least early out on some wavefonts, so in these cases the early return does actually save some work.

So it would be the best if you'd never actually need a number of threads that is not a multiple of your group size (which, in turn, is hopefully a multiple of the hardwares wavefont size). But, if that's not possible, limiting the number of threads in that way is the next best option, especially because all threads that do reach the early return are next to each other, which maximizes the chance that a whole wavefont can early out.