2
votes

I am new to Cuda programming. And I am trying out RGB to Grey Scale Conversion. But, I can't figure out how to choose block size and grid size. I came across this piece of code and it executed properly. But I can't understand how the gridSize was chosen. I am using Tegra TK1 Gpu which has-

  1. 1 MP,192 cuda cores/MP.
  2. Max number of thread/block=1024.
  3. Max numberof resident warp/mp=64.
  4. max dimension size of thread/block=(1024,1024,64).
  5. max dimension of grid size=(2147483647,65535,65535).

My doubts are-

  1. How to determine block size and grid size?
  2. If I change the block size from (16,16,1) to (32,32,1) the time taken is more. Why is that?

Can you also give link to any good papers/books related to this? Thank you in advance.

Here is the code-

_global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x; //Column
    int j = blockIdx.y * blockDim.y + threadIdx.y; //Row

    int idx = j * numCols + i;

    if(i>=numCols || j>=numRows) return;

    float channelSum = .299f * rgbaImage[idx].x + .587f * rgbaImage[idx].y + .114f *     rgbaImage[idx].z;
    greyImage[idx]= channelSum;
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols + (blockSize.x-1)) /blockSize.x , (numRows +(blockSize.y-1)) /blockSize.y, 1);
rgba_to_greyscale<<<gridSize,blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); 
checkCudaErrors(cudaGetLastError());
}

EDIT- The code which I used before using the above mentioned code, to map a 2D array to a grid of blocks in CUDA is-

_global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
    int col = threadIdx.x;
    int row = blockIdx.x;
    int idx = col+row*numCols;
    int R = rgbaImage[idx].x;
    int G = rgbaImage[idx].y;
    int B = rgbaImage[idx].z;
    greyImage[idx] = 0.299f*R + 0.587f*G + 0.114f*B;
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{

    const dim3 blockSize( numCols, 1, 1);
    const dim3 gridSize( numRows, 1, 1);
    rgba_to_greyscale<<<gridSize,blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

}

I understood the mistake in this code. The mistake here is, if the numRows and numCols is more than 1024, it will show an error as maximum thread per block is 1024. So, I can use maximum of 1024*1024 pixel. If a image has more number of pixels, I can't use this. And now I have got the output with the first code(the top most code) , but I can't understand the logic behind it.

1
There are a lot of points that need to be explained in order to try giving you an answer. Note that your problem is not related to RGB-to-Grey conversion but how to map a 2D array to a grid of blocks in CUDA. You should start learning the basic. Google will show you a bunch of quality documents, blogs and books. Said that, you have reached a limiting factor increasing the block size to 32x32 (start from the beginning will help you understand this comment ;). Finally, the code is not enough to reproduce the behaviour of your question and we do not recommend books in SO. Despite that, welcome.pQB
I know that the title is not proper. It was a mistake. And I have been going through the tutorials from udacity. Also, I have seen many other documents relating to this. I have understood how to use blocksize and gridsize.But, I don't understand why this particular gridSize was used in this example. When I tried implementing a code (different code), it worked for images with smaller pixels.So while searching for my mistake, I found above mentioned piece of code from here. stackoverflow.com/questions/17119198/…zwarrior
So, you got solved part of the doubts in the aforementioned question / answer.pQB
The first code mentioned, I got it from one of the answers in above mentioned link. Which is perfect. The second code, I tried. But, isn't perfect. What I understood is, we must allocate for example, consider 1000 pixels are there in an image. Then if we allocate <1000 threads, the conversion won't complete as number of threads is less because each thread converts 1 pixel. But, If we assign >1000 threads, conversion completes as the threads are more but the time taken is more. So, we should assign threads a little above 1000 threads. Am I wrong in this assumption?zwarrior

1 Answers

2
votes

In the technical specification for CUDA devices with a compute capability 3.2, such as the Tegra TK1, we can see some limiting factors that will be related with the performance results you have described. See for example:

Maximum number of threads per multiprocessor: 2048

Maximum number of threads per block: 1024

Maximum number of resident blocks per multiprocessor: 16

Maximum number of resident warps per multiprocessor: 64

If we (me) can assume there are not any limiting factor execpt the maximum number of threads (the kernel does not use shared memory and I think the number of registers will be less than 63 per thread).

Then, with a block of 16 x 16 threads, that is, 256 threads or 8 warps, we have a maximum of 8 concurrent blocks per SM (limited by the maximum number of concurrent warps per SM). If you change the size of the block to 32 x 32 (1024 threads or 32 warps), the maximum number of concurrent blocks will be 2. That's probably the main reason because the execution time is longer with the second configuration.

The best configuration of the block size is usually a bit tricky and it is based a bit on trial and error. By default we (me) always start maximizing the occupancy, and then try other configurations.