1
votes

I have a sequential smoothing algorithm

void triangularSmooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter, NSTimer &timer, dim3 grid_size, dim3 block_size) {
for ( int y = 0; y < height; y++ ) {
    for ( int x = 0; x < width; x++ ) {
        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }

        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
    }
}
}

I am implementing in CUDA and wish to use a shared variable to hold the pixels in grayImage. However before that, I'm trying to run it as it is. To this end I have kernel code:

__global__ void smooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter)
{

        int x = blockIdx.x*blockDim.x + threadIdx.x;
        int y = blockIdx.y*blockDim.y + threadIdx.y;

        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }
        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
}

And calling with:

const float filter[] = {1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 2.0f, 3.0f, 2.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f};
dim3 gridSize((width*height)/1024,(width*height)/1024,1);
dim3 blockSize(256,256,1);
smooth <<< gridSize, blockSize >>> (grayImage, smoothImage, width, height, filter);
cudaDeviceSynchronize();

The problem is that, the resulting smooth image looking like the pixels are all in the wrong other (mixed up). Is this from the dimensions of the grid and block? I've tried a LOT of other possible dimensions. What would be the right way?

I'm using a GTX480, version - 2.x, Maximum dimensionality of grid of thread blocks - 3, Maximum x-, y-, or z-dimension of a grid of thread blocks - 65535, Maximum Number of Threads per Block - 1024

2
Your kernel is never running because the blocksize is illegal. If you add some error checking to your code you will see the kernel launch fails with an invalid configuration error.talonmies
i have a cudaGetLastError(); immediately after cudaDeviceSynchronize(); and it returns no errorsFrancis Saa-Dittoh
Please read this question and answer for the correct way to check for errors during a kernel launch. Note in your question you have said your GPU has a limit of 1024 threads per block, and you are asking for 256*256*1 threads per block.....talonmies
Was not checking properly; you are right! However, I also get an error with (16,16,1) or (32,32,1)Francis Saa-Dittoh
So that this question isn't a complete waste of everyone's time, please write your solution into an answer. You will later be able to accept that answer, which marks the question as answeredtalonmies

2 Answers

1
votes

First, the dimensions are totally invalid. The following should work in this case;

dim3 blockSize(16, 16, 1);
dim3 gridSize((width + blockSize.x - 1)/ blockSize.x, (height + blockSize.y - 1) / blockSize.y, 1);
smooth <<< grid_size, block_size >>> (grayImage, smoothImage, width, height);

After the correction, using cuda-memcheck yielded results similar to;

========= Invalid __global__ read of size 4
=========     at 0x00000120 in cudaFilter
=========     by thread (4,1,0) in block (1,0,0)
=========     Address 0x05100190 is out of bounds

This shows that a value within the kernel code is out of bounds (most possibly an array index). Checking the various variables led to determine that filter[] was empty.

Lastly, if filter[] is to be passed to the kernel, it should be copied from CPU to GPU using something like

cudaMemcpy(filterGpu, filter, 25 * sizeof(float), cudaMemcpyHostToDevice);

Alternatively, if the filter is not needed anywhere else (as is the case here), it can be declared within the kernel instead.

1
votes

Looking at this answer related to image filtering, I would recommend that you create the block and grid for the image like this:

dim3 blockSize(16,16,1);
dim3 gridSize((width + blockSize.x - 1)/blockSize.x,(height + blockSize.y - 1)/blockSize.y,1);

Another very common mistake that you are making is that the filter array you are passing to the kernel, is allocated on the host. Create an array of same size on the device and copy the coefficients from host to device. Pass that device array to the kernel.

Also, it is highly recommended to calculate the sum of filter coefficients on host side and pass it as an argument to the kernel instead of calculating the sum again and again in each thread.

The boundary conditions may cause out of range memory access. Handle the boundary conditions explicitly in the kernel. Or the easy approach is to use CUDA texture for the input image so that boundary conditions are handled automatically.