OpenCL AMD vs NVIDIA performance

Question

I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong). I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.

I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.

All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.

Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.

Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.

__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
            int imgWidth, // base image width
            int imgHeight, // base image height
            int imgPitch, // base image pitch (in bytes)
            int imgBpp, // base image bytes (!) per pixel
            __constant unsigned char* templates, // pointer to the array of templates
            int tWidth, // templates width (the same for all)
            int tHeight, // templates height (the same for all)
            int tPitch, // templates pitch (in bytes, the same for all)
            int tBpp, // templates bytes (!) per pixel (the same for all)
            int diff, // max allowed difference of intensity
            int maxNonmatchQt, // max number of nonmatched pixels
            __global int* result, // results
                            ) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
    return;
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
    int ind = y * tPitch;
    int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
    for( int x = 0; x < tWidth; x++) {
        unsigned char c0 = image[baseImgInd];
        unsigned char c1 = image[baseImgInd + 1];
        unsigned char c2 = image[baseImgInd + 2];
        for( int i = 0; i < 6; i++)
            if( abs( c0 - templates[i * tOffset + ind]) > diff || 
                            abs( c1 - templates[i * tOffset + ind + 1]) > diff || 
                            abs( c2 - templates[i * tOffset + ind + 2]) > diff)
                nonmatchQt[i]++;
        ind += tBpp;
        baseImgInd += imgBpp;
    }
    if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
        return;
}
for( int i = 0; i < 6; i++)
    if( nonmatchQt[i] < maxNonmatchQt) {
        unsigned int pos = atom_inc( &result[0]) * 3;
        result[pos + 1] = i;
        result[pos + 2] = x0;
        result[pos + 3] = y0;
    }
}

Kernel run configuration: Global work size = (1900, 1200) Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.

Execution time: HD 5850 - 69 ms, HD 6670 - 200 ms, GT 240 - 100 ms.

Any remarks about my code are also highly appreciated.

There's nowhere near enough information here to answer this! Each of the NVidia and AMD cards are architecturally tricky beasts, and the performance you see on either depends a lot on the code; understanding the performance difference between the two is even trickier. Can you post your kernel and a driver? — Jonathan Dursi
What kind of algorithm are you running in your kernel? Memory access patterns? wavefront/warp size? Need more info to be able to advise. — mfa
How many threads are you launching? And are you vectorizing the array? — captain
thank you all for your answers. I will post my kernel code tomorrow. — AdelNick
What sort of ratio between maxNonmatchQt and the template size (tW*tH) should be expected? This info could help with optimization. Are you using 3 bytes per pixel, or 3*n bytes per pixel? I am assuming that each of the RGB channels are using the same number of bytes. Is this correct? Also, can you explain imgPitch and tPitch please? I can't figure it out entirely from the code alone. thanks — mfa

gudasergey gudasergey · Accepted Answer · 2013-10-27T17:48:51

The difference in execution times is caused by compilers. Your code can be easily vectorized. Consider image and templates as arrays of vector type char4 (forth coordinate of each char4 vector is always 0). Instead of 3 memory reads:

unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];

use only one:

unsigned char4 c = image[baseImgInd];

Instead of bulky if:

    if( abs( c0 - templates[i * tOffset + ind]) > diff || 
               abs( c1 - templates[i * tOffset + ind + 1]) > diff || 
               abs( c2 - templates[i * tOffset + ind + 2]) > diff)
         nonmatchQt[i]++;

use fast:

    unsigned char4 t = templates[i * tOffset + ind];
    nonmatchQt[i] += any(abs_diff(c,t)>diff);

Thus you increase performance of your code up to 3 times (if compiler doesn't vectorize the code by itself). I suppose that AMD OpenCL compiler does not such vectorization and other optimizations. From my experience OpenCL on NVIDIA GPU usually can be made faster than CUDA, because it is more low-level.

OpenCL AMD vs NVIDIA performance

2 Answers