A fragment shader uses two atomic counters. It may or may not increment the first and may or may not increment the second (but never both). Before so modifying the counters, however, their current values are always read and --if the counters are then later modified-- those previously read values used for some custom logic. All this happens in a (most likely unrollable) loop.
Envision a flow roughly like this:
- in some small unrollable loop, say FOR 0-20 (compile-time resolvable const)...
- get counter values for AC1 and AC2
- check some value:
- if x: set texel in uimage1D_A at index AC1, increment AC1
- else: set texel in uimage1D_B at index (imgwidth-AC2-1), increment AC2
Question: the shader queries the current counter value -- does it always get the "most current" value? Do I lose the massive parallelism of fragment shaders here (speaking in terms of of current-generation and future GPUs and drivers only)?
As for the branching (if x) -- I compare a texel in another (readonly restrict uniform
) uimage1D
to a (uniform
) uint
. So one operand is definitely a uniform scalar, but the other is an imageLoad().x
although the image is uniform -- is this sort of branching still "fully parallelized"? You can see both branches are each exactly two, almost identical instructions. Assuming a "perfectly optimizing" GLSL compiler, is this kind of branching likely introducing a stall?