1
votes

A fragment shader uses two atomic counters. It may or may not increment the first and may or may not increment the second (but never both). Before so modifying the counters, however, their current values are always read and --if the counters are then later modified-- those previously read values used for some custom logic. All this happens in a (most likely unrollable) loop.

Envision a flow roughly like this:

  • in some small unrollable loop, say FOR 0-20 (compile-time resolvable const)...
  • get counter values for AC1 and AC2
  • check some value:
  • if x: set texel in uimage1D_A at index AC1, increment AC1
  • else: set texel in uimage1D_B at index (imgwidth-AC2-1), increment AC2

Question: the shader queries the current counter value -- does it always get the "most current" value? Do I lose the massive parallelism of fragment shaders here (speaking in terms of of current-generation and future GPUs and drivers only)?

As for the branching (if x) -- I compare a texel in another (readonly restrict uniform) uimage1D to a (uniform) uint. So one operand is definitely a uniform scalar, but the other is an imageLoad().x although the image is uniform -- is this sort of branching still "fully parallelized"? You can see both branches are each exactly two, almost identical instructions. Assuming a "perfectly optimizing" GLSL compiler, is this kind of branching likely introducing a stall?

2

2 Answers

5
votes

Atomic counters are atomic. But each atomic operation is atomic only for that operation.

So, if you want to ensure that every shader gets a unique value from a counter, then every shader must access that counter only with atomicCounterIncrement (or Decrement, but they must all use the same one).

The correct way to do what you're suggesting is:

  1. check some value:
  2. if x:
    1. atomicCounterIncrement(AC1), storing the value returned.
    2. Use the stored value as the texel at which to set something into uimage1D_A.
  3. else:
    1. atomicCounterIncrement(AC2), storing the value returned.
    2. Use the stored value to compute the texel (imgwidth - val - 1) at which to set something into uimage1D_B.

Your "fetch and later increment" strategy is a race condition waiting to happen. It doesn't matter if it's "fully parallelized" because it's broken. You need it to work before wondering if it's going to be fast.

I would strongly advise getting familiar with atomics and threading on CPUs before trying to tackle GPU stuff. This is a common mistake made by novices when working with atomics. You need to be a threading expert (or at least intermediate-level) if you want to use successfully GLSL atomics and image load/store.

2
votes

As Nicol Bolas suggested, if you want to ensure the value you read from the atomic counter won't ever be read by another kernel, you need to perform an atomic increment and use the returned value, which no other kernel will have unless they performed atomicCounter(AC1) which checks the value without incrementing. The moment you atomically increment the value and get back the old value, you make sure that everyone else who does the same is only going to get the incremented value.

You seem to be doing an A-Buffer, I'm curious as to why you need the second counter. I assume uimage1D_A is your screen-sized map of pointers to the fragment list which is stored in uimage1D_B, am I right? You use AC2 to generate a pointer to a new unused memory part of uimage1D_B, but your AC1 suggests you are gradually acessing uimage1D_A so I might be completely wrong :)