Synchronise atomic counter across multiple gpu's

Question

I use an atomic counter in a compute shader with an atomic_uint bound to a dynamic GL_ATOMIC_COUNTER_BUFFER (in a similar way to this opengl-atomic-counter tutorial lighthouse3d).

I'm using the atomic counter in a particle system to check a condition has been reached for all particles; I expect to see counter==numParticles when all of the particles are in the correct place.

I map the buffer each frame and check if the atomic counter has counted all of the particles:

GLuint *ptr = (GLuint *) glMapBuffer( GL_ATOMIC_COUNTER_BUFFER, GL_READ_ONLY );
GLuint particleCount = ptr[ 0 ];
glUnmapBuffer( GL_ATOMIC_COUNTER_BUFFER );
if( particleCount == numParticles() ){ // do stuff }

On a single GPU host the code works fine and particleCount always reaches numParticles() but on a multi gpu host the particleCount never reaches numParticles().

I can visually check that the condition has been reached and the test should be true however particleCount is changing each frame going up and down but never reaching numParticles().

I have tried an opengl memory barrier on the GL_ATOMIC_COUNTER_BARRIER_BIT before I unmap particleCount:

glMemoryBarrier(GL_ATOMIC_COUNTER_BARRIER_BIT);
GLuint *ptr = (GLuint *) glMapBuffer( GL_ATOMIC_COUNTER_BUFFER, GL_READ_ONLY );
GLuint particleCount = ptr[ 0 ];
glUnmapBuffer( GL_ATOMIC_COUNTER_BUFFER );
if( particleCount == m_particleSystem->numParticles() )
{ // do stuff }

and I've tried a glsl barrier before incrementing the counter in the compute shader:

memoryBarrierAtomicCounter();
atomicCounterIncrement( particleCount );

but the atomic counter doesn't seem to synchronise across devices.

What is the correct way to synchronise so that the atomic counter works with multiple devices?

Andon M. Coleman Andon M. Coleman · Accepted Answer · 2015-01-19T15:12:36

Your choice of memory barrier is actually inappropriate in this situation.

That barrier (GL_ATOMIC_COUNTER_BARRIER_BIT) would make changes to the atomic counter visible (e.g. flush caches and run shaders in a specific order), but what it does not do is make sure that any concurrent shaders are complete before you map, read and unmap your buffer.

Since your buffer is being mapped and read back, you do not need that barrier - that barrier is for coherency between shader passes. What you really need is to ensure all shaders that access your atomic counter are finished before you try to read data using a GL command, and for this you need GL_BUFFER_UPDATE_BARRIER_BIT.

GL_BUFFER_UPDATE_BARRIER_BIT:

Reads/writes via glBuffer(Sub)Data, glCopyBufferSubData, glProgramBufferParametersNV, and glGetBufferSubData, or to buffer object memory mapped by glMapBuffer(Range) after the barrier will reflect data written by shaders prior to the barrier.

Additionally, writes via these commands issued after the barrier will wait on the completion of any shader writes to the same memory initiated prior to the barrier.

You may be thinking about barriers from the wrong perspective. The barrier you need depends on which type of operation the memory read needs to be coherent to.

I would suggest brushing up on the incoherent memory access usecases:

(1) Shader write/read between rendering commands

One Rendering Command writes incoherently, and the other reads. There is no need for coherent^{(GLSL qualifier)} here at all. Just use glMemoryBarrier before issuing the reading rendering command, using the appropriate access bit.

(2) Shader writes, other OpenGL operations read

Again, coherent is not necessary. You must use a glMemoryBarrier before performing the read, using a bitfield that is appropriate to the reading operation of interest.

In case (1), the barrier you want is in-fact GL_ATOMIC_COUNTER_BARRIER_BIT, because it will force strict memory and execution order rules between different shader passes that share the same atomic counter.

In case (2), the barrier you want is GL_BUFFER_UPDATE_BARRIER_BIT. The "reading operation of interest" is glMapBuffer (...) and as shown above, that is covered under GL_BUFFER_UPDATE_BARRIER_BIT.

In your situation, you are reading the buffer back using the GL API. You need GL commands to wait for all pending shaders to finish writing (this does not happen automatically for incoherent memory access - image load/store, atomic counters, etc.). That is textbook case (2).

Synchronise atomic counter across multiple gpu's

1 Answers

Your choice of memory barrier is actually inappropriate in this situation.

I would suggest brushing up on the incoherent memory access usecases: