Memory barrier fails to sync between compute stage and data access by CUDA

Question

I have the following pipeline:

Render into texture attachment to custom FBO.
Bind that texture attachment as image.
Run compute shader ,sampling from the image above using imageLoad/Store.
Write the results into SSBO or image.
Map the SSBO (or image) as CUDA CUgraphicsResource and process the data from that buffer using CUDA.

Now,the problem is in synchronizing data between the stages 4 and 5. Here are the sync solutions I have tried.

glFlush - doesn't really work as it doesn't guarantee the completeness of the execution of all the commands.

glFinish - this one works. But it is not recommended as it finalizes all the commands submitted to the driver.

ARB_sync Here it is said it is not recommended because it heavily impacts performance.

glMemoryBarrier This one is interesting. But it simply doesn't work.

Here is example of the code:

glMemoryBarrier(GL_ALL_BARRIER_BITS);

And also tried:

glTextureBarrierNV()

The code execution goes like this:

 //rendered into the fbo...
  glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
  glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8); 
  glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
  glDispatchCompute(16, 16, 1);

  glFinish(); // <-- must sync here,otherwise cuda buffer doesn't receive all the data

 //cuda maps the image to CUDA buffer here..

Moreover, I tried unbinding FBOs and unbinding textures from the context before launching compute, I even tried to launch one compute after other with a glMemoryBarrier set between them, and fetching the target image from the first compute launch to CUDA. Still no synch. (Well,that makes sense as two computes also run out of sync with each other)

after the compute shader stage. It doesn't sync! Only when I replace with glFinish,or with any other operation which completely stall the pipeline. Like glMapBuffer(), for example.

~~So should I just use glFinish(), or I am missing something here? Why glMemoryBarrier() doesn't sync compute shader work before CUDA takes over the control?~~

UPDATE

I would like to refactor the question a little bit as the original one is pretty old. Nevertheless, even with the latest CUDA and Video Codec SDK (NVENC) the issue is still alive.So, I don't care about why glMemoryBarrier doesn't sync. What I want to know is:

If it is possible to synchronize OpenGL compute shader execution finish with CUDA's usage of that shared resource without stalling the whole rendering pipeline, which is in my case OpenGL image.
If the answer is 'yes', then how?

"ARB_sync Here it is said it is not recommended because it heavily impacts performance." No, it doesn't. And I quote, "Second, it is insufficient, because data may still be in a GPU cache. Sync objects don't ensure cache coherency. So don't do that." — Nicol Bolas
@NicolBolas I quote from the same place : " First, it's incredibly expensive, because it means having to wait to issue the second command until the first completed" ;) — Michael IV
There are two reasons listed for a reason. It's wrong to say that the Wiki doesn't recommend it solely because of performance. — Nicol Bolas

IGarFieldI IGarFieldI · Accepted Answer · 2019-11-11T09:16:11

I know this is an old question, but if any poor soul stumbles upon this...

First, the reason glMemoryBarrier does not work: it requires the OpenGL driver to insert a barrier into the pipeline. CUDA does not care about the OpenGL pipeline at all.

Second, the only other way outside of glFinish is to use glFenceSync in combination with glClientWaitSync:

....
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8); 
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
... other work you might want to do that does not impact the buffer...
GLenum res = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeoutInNs);
if(res == GL_TIMEOUT_EXPIRED || res == GL_WAIT_FAILED) {
    ...handle timeouts and failures
}
cudaGraphicsMapResources(1, &gfxResource, stream);
...

This will cause the CPU to block until the GPU is done with all commands until the fence. This includes memory transfers and compute operations.

Unfortunately, there is no way to tell CUDA to wait on an OpenGL memory barrier/fence. If you really require the extra bit of asynchronicity, you'll have to switch to DirectX 12, for which CUDA supports importing fences/semaphores and waiting on as well as signaling them from a CUDA stream via cuImportExternalSemaphore, cuWaitExternalSemaphoresAsync, and cuSignalExternalSemaphoresAsync.

Memory barrier fails to sync between compute stage and data access by CUDA

1 Answers