I'm implementing an algorithm to convert an image to grayscale using CUDA. I've got it working right now, but I'm looking for ways to improve performance. Right now, the entire color image is transferred to the device memory, after which each thread calculates the gray pixel value by looking up the corresponding three (r,g,b) color values.
I have already made sure that the access of global memory is coalesced, though this did not really improve my performance (a 36 mb image took 0.003 s less after the memory access was coalesced...). Right now, I'm wondering whether using shared memory could improve my performance. Here's what I have right now:
My CUDA kernel:
__global__ void darkenImage(const unsigned char * inputImage,
unsigned char * outputImage, const int width, const int height, int iteration){
int x = ((blockIdx.x * blockDim.x) + (threadIdx.x + (iteration * MAX_BLOCKS * nrThreads))) * 3;
if(x+2 < (3 * width*height)){
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[x]);
float g = static_cast< float >(inputImage[x+1]);
float b = static_cast< float >(inputImage[x+2]);
grayPix = __fadd_rn(__fadd_rn(__fmul_rn(0.3f, r),__fmul_rn(0.59f, g)), __fmul_rn(0.11f, b));
grayPix = fma(grayPix,0.6f,0.5f);
outputImage[(x/3)] = static_cast< unsigned char >(grayPix);
}
}
My question really is, because there is no memory shared between any two threads, using shared memory shouldn't really help here now should it? Or did I misunderstand?
Regards,
Linus