What are the right Grid and Block Dimensions for 2D triangular smooth in CUDA?

Question

I have a sequential smoothing algorithm

void triangularSmooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter, NSTimer &timer, dim3 grid_size, dim3 block_size) {
for ( int y = 0; y < height; y++ ) {
    for ( int x = 0; x < width; x++ ) {
        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }

        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
    }
}
}

I am implementing in CUDA and wish to use a shared variable to hold the pixels in grayImage. However before that, I'm trying to run it as it is. To this end I have kernel code:

__global__ void smooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter)
{

        int x = blockIdx.x*blockDim.x + threadIdx.x;
        int y = blockIdx.y*blockDim.y + threadIdx.y;

        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }
        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
}

And calling with:

const float filter[] = {1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 2.0f, 3.0f, 2.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f};
dim3 gridSize((width*height)/1024,(width*height)/1024,1);
dim3 blockSize(256,256,1);
smooth <<< gridSize, blockSize >>> (grayImage, smoothImage, width, height, filter);
cudaDeviceSynchronize();

The problem is that, the resulting smooth image looking like the pixels are all in the wrong other (mixed up). Is this from the dimensions of the grid and block? I've tried a LOT of other possible dimensions. What would be the right way?

I'm using a GTX480, version - 2.x, Maximum dimensionality of grid of thread blocks - 3, Maximum x-, y-, or z-dimension of a grid of thread blocks - 65535, Maximum Number of Threads per Block - 1024

Your kernel is never running because the blocksize is illegal. If you add some error checking to your code you will see the kernel launch fails with an invalid configuration error. — talonmies
i have a cudaGetLastError(); immediately after cudaDeviceSynchronize(); and it returns no errors — Francis Saa-Dittoh
Please read this question and answer for the correct way to check for errors during a kernel launch. Note in your question you have said your GPU has a limit of 1024 threads per block, and you are asking for 256*256*1 threads per block..... — talonmies
Was not checking properly; you are right! However, I also get an error with (16,16,1) or (32,32,1) — Francis Saa-Dittoh
So that this question isn't a complete waste of everyone's time, please write your solution into an answer. You will later be able to accept that answer, which marks the question as answered — talonmies

Francis Saa-Dittoh Francis Saa-Dittoh · Accepted Answer · 2013-02-07T09:14:13

First, the dimensions are totally invalid. The following should work in this case;

dim3 blockSize(16, 16, 1);
dim3 gridSize((width + blockSize.x - 1)/ blockSize.x, (height + blockSize.y - 1) / blockSize.y, 1);
smooth <<< grid_size, block_size >>> (grayImage, smoothImage, width, height);

After the correction, using cuda-memcheck yielded results similar to;

========= Invalid __global__ read of size 4
=========     at 0x00000120 in cudaFilter
=========     by thread (4,1,0) in block (1,0,0)
=========     Address 0x05100190 is out of bounds

This shows that a value within the kernel code is out of bounds (most possibly an array index). Checking the various variables led to determine that filter[] was empty.

Lastly, if filter[] is to be passed to the kernel, it should be copied from CPU to GPU using something like

cudaMemcpy(filterGpu, filter, 25 * sizeof(float), cudaMemcpyHostToDevice);

Alternatively, if the filter is not needed anywhere else (as is the case here), it can be declared within the kernel instead.

What are the right Grid and Block Dimensions for 2D triangular smooth in CUDA?

2 Answers