OpenCL/OpenGL interop wasting CPU

Question

I generate frames in OpenCL 60 times per second using one OpenCL kernel call each time and write them to an OpenGL texture so that I can display them on the screen. There's no performance problem, the frame rate is as expected, the problem however is that it's very wasteful, it keeps at least one CPU core fully busy, even when it has very little to do, like drawing a blank frame at a very low resolution. For comparison when I don't use the OpenGL interop but instead write from the CL kernel to a generic buffer and then copy that buffer back to the host to then display it in another way the frame rate drops a bit (due to the back and forth overhead that the interop makes unnecessary) but then the CPU usage is much lower when there's little to do.

This means that there's something wrong with the way I do the interop that I assume must create some sort of busy wait.

Here's the relevant code, which is the code that is there when I use the interop and not there when I don't use it. In one place of my loop I clear the GL texture and make OpenCL acquire it:

    uint32_t z = 0;
    glClearTexImage(fb.gltex, 0, GL_RGBA, GL_UNSIGNED_BYTE, &z);
    glFlush();
    glFinish();

    clEnqueueAcquireGLObjects(fb.clctx.command_queue, 1,  &fb.cl_srgb, 0, 0, NULL);

Then I enqueue the execution of my OpenCL kernel which writes to the texture as the cl_mem object fb.cl_srgb and later I give control back to OpenGL in order to display the texture on the display:

    clEnqueueReleaseGLObjects(fb.clctx.command_queue, 1, &fb.cl_srgb, 0, 0, NULL);
    clFinish(fb.clctx.command_queue);   // this blocks until the kernel is done writing to the texture and releasing the texture

    // setting GL texture coordinates, probably not relevant to this question
    float hoff = 2. * (fb.h - fb.maxdim.y) / (double) fb.maxdim.y;
    glLoadIdentity();             // Reset the projection matrix
    glViewport(0, 0, fb.maxdim.x, fb.maxdim.y);

    glBegin(GL_QUADS);
    glTexCoord2f(0.f, 0.f); glVertex2f(-1., 1.+hoff);
    glTexCoord2f(1.f, 0.f); glVertex2f(1., 1.+hoff);
    glTexCoord2f(1.f, 1.f); glVertex2f(1., -1.+hoff);
    glTexCoord2f(0.f, 1.f); glVertex2f(-1., -1.+hoff);
    glEnd();

    SDL_GL_SwapWindow(fb.window);

It's hard for me to tell what is causing it because the high CPU usage is in another thread ran by nvopencl64.dll (when I run it on my Windows 10 machine with an nVidia GPU, but I have a similar problem with a laptop with an Intel iGPU, also on Windows 10).

Profiling tells me that most of the CPU time is taken by WaitForSingleObjectEx (exclusive 42% of the CPU time) called from nvopencl64.dll, WaitForMultipleObjects (21%) called from nvoglv64.dll's DrvPresentBuffers and the RtlUserThreadStart (16%) calls that originate the aforementioned WaitForMultipleObjects calls. That's for my nVidia GPU machine, but the situation looks pretty similar on a machine with only an Intel HD 5000 iGPU. So there's clearly something very inefficient going on, probably with lots of threads being started way too often.

Seems like the GPU doesn`t want you to do what you do. Does clGetDeviceInfo(CL_DEVICE_PREFERRED_INTEROP_USER_SYNC) return CL_FALSE? — hidefromkgb
Okay, so your GPU doesn`t require manual sync. Try removing clEnqueueAcquireGLObjects and clEnqueueReleaseGLObjects and see what happens. On a side note, are you using clCreateFromGLTexture to pass GL textures to CL? — hidefromkgb
Now that`s curious… Somewhere around tomorrow I`ll try and execute your code on my machine to see for myself. — hidefromkgb
Thank you! Did you get to build the code? Maybe the instructions I wrote aren't so clear. Anyway I just tried to follow the clCreateFromGLTexture() call with a single clEnqueueAcquireGLObjects() call with no further interop calls, that fixes the problem, giving me about only 2% of CPU usage (instead of 1 or 2 full cores) while still running at 60 FPS, which seems optimal. I guess that's what has to be done when CL_DEVICE_PREFERRED_INTEROP_USER_SYNC gives us 0. Since you hinted at that solution and there's a bounty you should write the answer. Thank you for your help! — Michel Rouzic
Unfortunately I couldn`t compile it yesterday. But anyway, you seem to have solved the problem yourself, so the bounty is not rightfully mine. Please answer your own question (it`s both permitted and encouraged) for the benefit of anyone who might be pulling the hair out in search of an answer for this very problem =) — hidefromkgb

Michel Rouzic Michel Rouzic · Accepted Answer · 2019-07-01T09:38:46

It seems that when CL_DEVICE_PREFERRED_INTEROP_USER_SYNC is false then manual synchronisation with clEnqueueAcquireGLObjects and clEnqueueReleaseGLObjects is unneeded, except for one clEnqueueAcquireGLObjects call after the initialisation of the OpenGL texture. In that case it seems that glFinish is the only needed form of synchronisation.

OpenCL/OpenGL interop wasting CPU

1 Answers