I generate frames in OpenCL 60 times per second using one OpenCL kernel call each time and write them to an OpenGL texture so that I can display them on the screen. There's no performance problem, the frame rate is as expected, the problem however is that it's very wasteful, it keeps at least one CPU core fully busy, even when it has very little to do, like drawing a blank frame at a very low resolution. For comparison when I don't use the OpenGL interop but instead write from the CL kernel to a generic buffer and then copy that buffer back to the host to then display it in another way the frame rate drops a bit (due to the back and forth overhead that the interop makes unnecessary) but then the CPU usage is much lower when there's little to do.
This means that there's something wrong with the way I do the interop that I assume must create some sort of busy wait.
Here's the relevant code, which is the code that is there when I use the interop and not there when I don't use it. In one place of my loop I clear the GL texture and make OpenCL acquire it:
uint32_t z = 0;
glClearTexImage(fb.gltex, 0, GL_RGBA, GL_UNSIGNED_BYTE, &z);
glFlush();
glFinish();
clEnqueueAcquireGLObjects(fb.clctx.command_queue, 1, &fb.cl_srgb, 0, 0, NULL);
Then I enqueue the execution of my OpenCL kernel which writes to the texture as the cl_mem
object fb.cl_srgb
and later I give control back to OpenGL in order to display the texture on the display:
clEnqueueReleaseGLObjects(fb.clctx.command_queue, 1, &fb.cl_srgb, 0, 0, NULL);
clFinish(fb.clctx.command_queue); // this blocks until the kernel is done writing to the texture and releasing the texture
// setting GL texture coordinates, probably not relevant to this question
float hoff = 2. * (fb.h - fb.maxdim.y) / (double) fb.maxdim.y;
glLoadIdentity(); // Reset the projection matrix
glViewport(0, 0, fb.maxdim.x, fb.maxdim.y);
glBegin(GL_QUADS);
glTexCoord2f(0.f, 0.f); glVertex2f(-1., 1.+hoff);
glTexCoord2f(1.f, 0.f); glVertex2f(1., 1.+hoff);
glTexCoord2f(1.f, 1.f); glVertex2f(1., -1.+hoff);
glTexCoord2f(0.f, 1.f); glVertex2f(-1., -1.+hoff);
glEnd();
SDL_GL_SwapWindow(fb.window);
It's hard for me to tell what is causing it because the high CPU usage is in another thread ran by nvopencl64.dll (when I run it on my Windows 10 machine with an nVidia GPU, but I have a similar problem with a laptop with an Intel iGPU, also on Windows 10).
Profiling tells me that most of the CPU time is taken by WaitForSingleObjectEx
(exclusive 42% of the CPU time) called from nvopencl64.dll, WaitForMultipleObjects
(21%) called from nvoglv64.dll's DrvPresentBuffers
and the RtlUserThreadStart
(16%) calls that originate the aforementioned WaitForMultipleObjects
calls. That's for my nVidia GPU machine, but the situation looks pretty similar on a machine with only an Intel HD 5000 iGPU. So there's clearly something very inefficient going on, probably with lots of threads being started way too often.
clGetDeviceInfo(CL_DEVICE_PREFERRED_INTEROP_USER_SYNC)
returnCL_FALSE
? – hidefromkgbclEnqueueAcquireGLObjects
andclEnqueueReleaseGLObjects
and see what happens. On a side note, are you usingclCreateFromGLTexture
to pass GL textures to CL? – hidefromkgbclCreateFromGLTexture()
call with a singleclEnqueueAcquireGLObjects()
call with no further interop calls, that fixes the problem, giving me about only 2% of CPU usage (instead of 1 or 2 full cores) while still running at 60 FPS, which seems optimal. I guess that's what has to be done whenCL_DEVICE_PREFERRED_INTEROP_USER_SYNC
gives us 0. Since you hinted at that solution and there's a bounty you should write the answer. Thank you for your help! – Michel Rouzic