OpenCL/GL Interop slow on nvidia/win but fast on linux?

Question

Below problem is fixed in nvidia's new driver release 331.xx, currently available as beta driver.

Thanks for all your comments!

I have a multi-platform application that does many fragment operations and gpgpu stuff on OpenGL textures. The application makes heavy use of GL/CL interop, each texture may be bound to an OpenCL image and manipulated using CL kernels.

The problem is, the application runs fast on AMD cards, both Linux and Windows. On NVIDIA cards, it runs fast on Linux, but very slowly on Windows 7. Problem seems to be enqueueAcquireGLObjects and enqueueReleaseGLObjects. I have created a minimal sample, demonstrating the bad performance by simply:

Creating 2 OpenGL textures (1600x1200 pixel, RGBA float)
Creating 2 OpenCL images, sharing the 2 textures
repeatedly (50 times) acquire, release, finish

Results (mean time for executing acquire, release, finish)

AMD HD 6980, Linux: <0.1 ms
AMD HD 6980, Win7: <0.1 ms
NVIDIA GTX590, Linux: <0.1 ms
NVIDIA GTX590, Win7 : 16.0 ms

I have tried several different drivers from nvidia, from older 295.73 to current beta drivers 326.80, all showing the same behaviour.

My question now is, is the nvidia driver seriously broken or am I doing something wrong here? The code runs fast on linux, so it cant be a general problem with nvidia support for OpenCL. The code runs fast on AMD+Win, so it can not be a problem with my code being not optimized for Windows. Optimizing the code by, for example, changing the cl images to read/write-only is senseless, since performance hit is almost factor 30!

Below you can find the relevant code of my test case, I could provide full source code, too.

relevant code for context creation

{ // initialize GLEW
  glewInit();
}

{ // initialize CL Context, sharing GL Contet
  std::vector<cl::Platform> platforms;
  cl::Platform::get(&platforms);
  cl_context_properties cps[] = {
             CL_GL_CONTEXT_KHR,(cl_context_properties)wglGetCurrentContext(),
             CL_WGL_HDC_KHR,(cl_context_properties)wglGetCurrentDC(),
             CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()),
             0};
  std::vector<cl::Device> devices;
  platforms[0].getDevices((cl_device_type)CL_DEVICE_TYPE_GPU, &devices);
  context_ = new cl::Context(devices, cps, NULL, this);
  queue_ = new cl::CommandQueue(*context_, devices[0]);
}

relevant code for creating textures and sharing CL images

width_ = 1600;
height_ = 1200;

float *data = new float[ 1600*1200*4 ];

textures_.resize(2);
glGenTextures(2, textures_.data());

for (int i=0;i<2;i++) {
  glBindTexture(GL_TEXTURE_2D, textures_[i]);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
  // "data" pointer holds random/uninitialized data, do not care in this example
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F_ARB, width_,height_, 0, GL_RGBA, GL_FLOAT, data);
}

delete data;
{ // create shared CL Images
#ifdef CL_VERSION_1_2
  clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
  clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#else
  clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
  clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#endif
}

relevant code for one acquire, release, finish cycle

try {
  queue_->enqueueAcquireGLObjects( &clImages_ );
  queue_->enqueueReleaseGLObjects( &clImages_ );
  queue_->finish();
} catch (cl::Error &e) {
  std::cout << e.what() << std::endl;
}

I can boil it down even more, just enqueueAcquireGLObjects, enqueueReleaseGLObjects will result in 16ms time on Win7. Have changed the question accordingly — user2725937
Maybe the GL device is not the same as the CL device, which is causing a copy overhead? Just guessing... — DarkZeros
I assume that if you use OpenGl you display something on the screen...For you test were you still displaying something too? — CaptainObvious
For the test, I am not rendering anything at all, GL context creation and texture allocation are the only calls to GL. GL and CL device could be different as the GTX590 is a dual chip card, yes. But i tried with GeForce cards ranging from 460 to 680. — user2725937
the problem is fixed in latest beta drivers 331.xx! thanks for all your comments! — user2725937

CaptainObvious CaptainObvious · Accepted Answer · 2013-08-29T11:51:34

I'm gonna make the assumption that since you are using OpenGL, you display something on the screen after the OCL computation.

So based on that assumption, my first thought would be to check in the NVIDIA control panel if the VSync is enable and if yes to disable it and retest.

As far as I recall, the default options regarding vsync are different for AMD and NVIDIA; which would explain the difference between the two GPUs.

Just in case, here is a post that explain how vsync can slow down the rendering.

OpenCL/GL Interop slow on nvidia/win but fast on linux?

1 Answers