Below problem is fixed in nvidia's new driver release 331.xx, currently available as beta driver.
Thanks for all your comments!
I have a multi-platform application that does many fragment operations and gpgpu stuff on OpenGL textures. The application makes heavy use of GL/CL interop, each texture may be bound to an OpenCL image and manipulated using CL kernels.
The problem is, the application runs fast on AMD cards, both Linux and Windows. On NVIDIA cards, it runs fast on Linux, but very slowly on Windows 7. Problem seems to be enqueueAcquireGLObjects and enqueueReleaseGLObjects. I have created a minimal sample, demonstrating the bad performance by simply:
- Creating 2 OpenGL textures (1600x1200 pixel, RGBA float)
- Creating 2 OpenCL images, sharing the 2 textures
- repeatedly (50 times) acquire, release, finish
Results (mean time for executing acquire, release, finish)
- AMD HD 6980, Linux: <0.1 ms
- AMD HD 6980, Win7: <0.1 ms
- NVIDIA GTX590, Linux: <0.1 ms
- NVIDIA GTX590, Win7 : 16.0 ms
I have tried several different drivers from nvidia, from older 295.73 to current beta drivers 326.80, all showing the same behaviour.
My question now is, is the nvidia driver seriously broken or am I doing something wrong here? The code runs fast on linux, so it cant be a general problem with nvidia support for OpenCL. The code runs fast on AMD+Win, so it can not be a problem with my code being not optimized for Windows. Optimizing the code by, for example, changing the cl images to read/write-only is senseless, since performance hit is almost factor 30!
Below you can find the relevant code of my test case, I could provide full source code, too.
relevant code for context creation
{ // initialize GLEW
glewInit();
}
{ // initialize CL Context, sharing GL Contet
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties cps[] = {
CL_GL_CONTEXT_KHR,(cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR,(cl_context_properties)wglGetCurrentDC(),
CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()),
0};
std::vector<cl::Device> devices;
platforms[0].getDevices((cl_device_type)CL_DEVICE_TYPE_GPU, &devices);
context_ = new cl::Context(devices, cps, NULL, this);
queue_ = new cl::CommandQueue(*context_, devices[0]);
}
relevant code for creating textures and sharing CL images
width_ = 1600;
height_ = 1200;
float *data = new float[ 1600*1200*4 ];
textures_.resize(2);
glGenTextures(2, textures_.data());
for (int i=0;i<2;i++) {
glBindTexture(GL_TEXTURE_2D, textures_[i]);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
// "data" pointer holds random/uninitialized data, do not care in this example
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F_ARB, width_,height_, 0, GL_RGBA, GL_FLOAT, data);
}
delete data;
{ // create shared CL Images
#ifdef CL_VERSION_1_2
clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#else
clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#endif
}
relevant code for one acquire, release, finish cycle
try {
queue_->enqueueAcquireGLObjects( &clImages_ );
queue_->enqueueReleaseGLObjects( &clImages_ );
queue_->finish();
} catch (cl::Error &e) {
std::cout << e.what() << std::endl;
}