This is a classic case of embarrassingly parallel image processing problem that can be very easily mapped to CUDA framework. The averaging filter is knows as Box Filter in image processing domains.
The easiest approach would be to use CUDA textures for the filtering process as the boundary conditions can be handled very easily by textures.
Assuming you have source and destination pointers allocated on the host. The procedure would be something like this.
- Allocate large enough memory to hold the source and destination images on device.
- Copy source image from host to device.
- Bind the source image device pointer to texture.
- Specify an appropriate block size and a grid large enough to cover every pixel of the image.
- Launch the filtering kernel using the specified grid and block size.
- Copy the results back to host.
- Unbind the texture
- Free device pointers.
Sample Implementation Of Box Filter
Kernel
texture<unsigned char, cudaTextureType2D> tex8u;
//Box Filter Kernel For Gray scale image with 8bit depth
__global__ void box_filter_kernel_8u_c1(unsigned char* output,const int width, const int height, const size_t pitch, const int fWidth, const int fHeight)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
const int filter_offset_x = fWidth/2;
const int filter_offset_y = fHeight/2;
float output_value = 0.0f;
//Make sure the current thread is inside the image bounds
if(xIndex<width && yIndex<height)
{
//Sum the window pixels
for(int i= -filter_offset_x; i<=filter_offset_x; i++)
{
for(int j=-filter_offset_y; j<=filter_offset_y; j++)
{
//No need to worry about Out-Of-Range access. tex2D automatically handles it.
output_value += tex2D(tex8u,xIndex + i,yIndex + j);
}
}
//Average the output value
output_value /= (fWidth * fHeight);
//Write the averaged value to the output.
//Transform 2D index to 1D index, because image is actually in linear memory
int index = yIndex * pitch + xIndex;
output[index] = static_cast<unsigned char>(output_value);
}
}
Wrapper Function:
void box_filter_8u_c1(unsigned char* CPUinput, unsigned char* CPUoutput, const int width, const int height, const int widthStep, const int filterWidth, const int filterHeight)
{
/*
* 2D memory is allocated as strided linear memory on GPU.
* The terminologies "Pitch", "WidthStep", and "Stride" are exactly the same thing.
* It is the size of a row in bytes.
* It is not necessary that width = widthStep.
* Total bytes occupied by the image = widthStep x height.
*/
//Declare GPU pointer
unsigned char *GPU_input, *GPU_output;
//Allocate 2D memory on GPU. Also known as Pitch Linear Memory
size_t gpu_image_pitch = 0;
cudaMallocPitch<unsigned char>(&GPU_input,&gpu_image_pitch,width,height);
cudaMallocPitch<unsigned char>(&GPU_output,&gpu_image_pitch,width,height);
//Copy data from host to device.
cudaMemcpy2D(GPU_input,gpu_image_pitch,CPUinput,widthStep,width,height,cudaMemcpyHostToDevice);
//Bind the image to the texture. Now the kernel will read the input image through the texture cache.
//Use tex2D function to read the image
cudaBindTexture2D(NULL,tex8u,GPU_input,width,height,gpu_image_pitch);
/*
* Set the behavior of tex2D for out-of-range image reads.
* cudaAddressModeBorder = Read Zero
* cudaAddressModeClamp = Read the nearest border pixel
* We can skip this step. The default mode is Clamp.
*/
tex8u.addressMode[0] = tex8u.addressMode[1] = cudaAddressModeBorder;
/*
* Specify a block size. 256 threads per block are sufficient.
* It can be increased, but keep in mind the limitations of the GPU.
* Older GPUs allow maximum 512 threads per block.
* Current GPUs allow maximum 1024 threads per block
*/
dim3 block_size(16,16);
/*
* Specify the grid size for the GPU.
* Make it generalized, so that the size of grid changes according to the input image size
*/
dim3 grid_size;
grid_size.x = (width + block_size.x - 1)/block_size.x; /*< Greater than or equal to image width */
grid_size.y = (height + block_size.y - 1)/block_size.y; /*< Greater than or equal to image height */
//Launch the kernel
box_filter_kernel_8u_c1<<<grid_size,block_size>>>(GPU_output,width,height,gpu_image_pitch,filterWidth,filterHeight);
//Copy the results back to CPU
cudaMemcpy2D(CPUoutput,widthStep,GPU_output,gpu_image_pitch,width,height,cudaMemcpyDeviceToHost);
//Release the texture
cudaUnbindTexture(tex8u);
//Free GPU memory
cudaFree(GPU_input);
cudaFree(GPU_output);
}
The good news is that you don't have to implement the filter yourself. The CUDA Toolkit comes with free signal and image processing library named NVIDIA Performance Primitives aka NPP, made by NVIDIA. NPP utilizes CUDA enabled GPUs to accelerate processing. The averaging filter is already implemented in NPP. The current version of NPP (5.0) has support for 8 bit, 1 channel and 4 channel images.
The functions are:
nppiFilterBox_8u_C1R
for 1 channel image.
nppiFilterBox_8u_C4R
for 4 channel image.