I am new to Cuda programming. And I am trying out RGB to Grey Scale Conversion. But, I can't figure out how to choose block size and grid size. I came across this piece of code and it executed properly. But I can't understand how the gridSize was chosen. I am using Tegra TK1 Gpu which has-
- 1 MP,192 cuda cores/MP.
- Max number of thread/block=1024.
- Max numberof resident warp/mp=64.
- max dimension size of thread/block=(1024,1024,64).
- max dimension of grid size=(2147483647,65535,65535).
My doubts are-
- How to determine block size and grid size?
- If I change the block size from (16,16,1) to (32,32,1) the time taken is more. Why is that?
Can you also give link to any good papers/books related to this? Thank you in advance.
Here is the code-
_global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int i = blockIdx.x * blockDim.x + threadIdx.x; //Column
int j = blockIdx.y * blockDim.y + threadIdx.y; //Row
int idx = j * numCols + i;
if(i>=numCols || j>=numRows) return;
float channelSum = .299f * rgbaImage[idx].x + .587f * rgbaImage[idx].y + .114f * rgbaImage[idx].z;
greyImage[idx]= channelSum;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols + (blockSize.x-1)) /blockSize.x , (numRows +(blockSize.y-1)) /blockSize.y, 1);
rgba_to_greyscale<<<gridSize,blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
}
EDIT- The code which I used before using the above mentioned code, to map a 2D array to a grid of blocks in CUDA is-
_global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int col = threadIdx.x;
int row = blockIdx.x;
int idx = col+row*numCols;
int R = rgbaImage[idx].x;
int G = rgbaImage[idx].y;
int B = rgbaImage[idx].z;
greyImage[idx] = 0.299f*R + 0.587f*G + 0.114f*B;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
const dim3 blockSize( numCols, 1, 1);
const dim3 gridSize( numRows, 1, 1);
rgba_to_greyscale<<<gridSize,blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
I understood the mistake in this code. The mistake here is, if the numRows and numCols is more than 1024, it will show an error as maximum thread per block is 1024. So, I can use maximum of 1024*1024 pixel. If a image has more number of pixels, I can't use this. And now I have got the output with the first code(the top most code) , but I can't understand the logic behind it.