18
votes

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.

The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.

It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?

1
OpenCL and CUDA are completely different. They both use the same HW in the end. But just as OpenGL and DirectX, one is not under the other or viceversa. Main points to state this is that the libraries are different, the compilers are different, and the execution model is different as well. Some parts might be common, but the majority is not. - DarkZeros
If you are on a 64-bit platform, my first guess would be that the OpenCL kernel is benefiting from the lower register pressure since it can be 32-bit. If the OpenCL toolchain permits, you should decompile the two and compare the microcode. - ArchaeaSoftware
NVIDIA OpenCL implementation is 32-bit and doesn't conform to the same function call requirements as CUDA. CUDA runtime applications compile the kernel code to have the same bitness as the application. On a 64-bit platform try compiling the CUDA application as a 32-bit application. Your use of double has nothing to do with the bitness of the application or kernel code. It is possible to get the PTX code from a OpenCL kernel so you can compare it against the CUDA code. At this time you cannot get the SASS code for OpenCL kernels. - Greg Smith
Are the numeric answers you get with OpenCL and CUDA identical? If not then the kernels aren't doing the same computations. - Tim Child
I know that this question is already a year old and I am probably pointing out the obvious, but in case that you are using CUDA runtime api, beware that CUDA has to initialize the driver before running your kernel code, which may in turn skew your timings compared to OpenCL... Try running some dummy iterations of your kernel before doing the timings - jcxz

1 Answers

24
votes

Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)

This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.

In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.

The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.

To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)