I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong). I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.
I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.
All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.
Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.
Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.
__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
int imgWidth, // base image width
int imgHeight, // base image height
int imgPitch, // base image pitch (in bytes)
int imgBpp, // base image bytes (!) per pixel
__constant unsigned char* templates, // pointer to the array of templates
int tWidth, // templates width (the same for all)
int tHeight, // templates height (the same for all)
int tPitch, // templates pitch (in bytes, the same for all)
int tBpp, // templates bytes (!) per pixel (the same for all)
int diff, // max allowed difference of intensity
int maxNonmatchQt, // max number of nonmatched pixels
__global int* result, // results
) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
return;
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
int ind = y * tPitch;
int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
for( int x = 0; x < tWidth; x++) {
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
for( int i = 0; i < 6; i++)
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
ind += tBpp;
baseImgInd += imgBpp;
}
if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
return;
}
for( int i = 0; i < 6; i++)
if( nonmatchQt[i] < maxNonmatchQt) {
unsigned int pos = atom_inc( &result[0]) * 3;
result[pos + 1] = i;
result[pos + 2] = x0;
result[pos + 3] = y0;
}
}
Kernel run configuration: Global work size = (1900, 1200) Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.
Execution time: HD 5850 - 69 ms, HD 6670 - 200 ms, GT 240 - 100 ms.
Any remarks about my code are also highly appreciated.