GLSL atomic counters (and branching) in fragment shaders

Question

A fragment shader uses two atomic counters. It may or may not increment the first and may or may not increment the second (but never both). Before so modifying the counters, however, their current values are always read and --if the counters are then later modified-- those previously read values used for some custom logic. All this happens in a (most likely unrollable) loop.

Envision a flow roughly like this:

in some small unrollable loop, say FOR 0-20 (compile-time resolvable const)...
get counter values for AC1 and AC2
check some value:
if x: set texel in uimage1D_A at index AC1, increment AC1
else: set texel in uimage1D_B at index (imgwidth-AC2-1), increment AC2

Question: the shader queries the current counter value -- does it always get the "most current" value? Do I lose the massive parallelism of fragment shaders here (speaking in terms of of current-generation and future GPUs and drivers only)?

As for the branching (if x) -- I compare a texel in another (readonly restrict uniform) uimage1D to a (uniform) uint. So one operand is definitely a uniform scalar, but the other is an imageLoad().x although the image is uniform -- is this sort of branching still "fully parallelized"? You can see both branches are each exactly two, almost identical instructions. Assuming a "perfectly optimizing" GLSL compiler, is this kind of branching likely introducing a stall?

Nicol Bolas Nicol Bolas · Accepted Answer · 2012-03-17T08:44:29

Atomic counters are atomic. But each atomic operation is atomic only for that operation.

So, if you want to ensure that every shader gets a unique value from a counter, then every shader must access that counter only with atomicCounterIncrement (or Decrement, but they must all use the same one).

The correct way to do what you're suggesting is:

check some value:
if x:
1. atomicCounterIncrement(AC1), storing the value returned.
2. Use the stored value as the texel at which to set something into uimage1D_A.
else:
1. atomicCounterIncrement(AC2), storing the value returned.
2. Use the stored value to compute the texel (imgwidth - val - 1) at which to set something into uimage1D_B.

Your "fetch and later increment" strategy is a race condition waiting to happen. It doesn't matter if it's "fully parallelized" because it's broken. You need it to work before wondering if it's going to be fast.

I would strongly advise getting familiar with atomics and threading on CPUs before trying to tackle GPU stuff. This is a common mistake made by novices when working with atomics. You need to be a threading expert (or at least intermediate-level) if you want to use successfully GLSL atomics and image load/store.

GLSL atomic counters (and branching) in fragment shaders

2 Answers