0
votes

I am writing a CUDA Program while working with OpenCV. I have an empty Mat of a given size (e.g. 1000x800) which I explicitly converted to GPUMat with dataytpe CV_16SC3. It is desired to manipulate the Image in this format in the CUDA Kernel. However trying to manipulate the Mat does not seem to work correctly.

I am calling my CUDA kernel as follows:

    my_kernel <<< gridDim, blockDim >>>( (unsigned short*)img.data, img.cols, img.rows, img.step);

and my sample kernel looks like this

__global__ void my_kernel( unsigned short* img, int width, int height, int img_step)
{
    int x, y, pixel;
    y = blockIdx.y * blockDim.y + threadIdx.y;
    x = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (y >= height)
        return;
    
    if (x >= width)
        return;

    pixel = (y * (img_step)) + (3 * x);
    
    img[pixel] = 255; //I know 255 is basically an uchar, this is just part of my test
    img[pixel+1] = 255
    img[pixel+2] = 255;

}

I am expecting this small kernel sample to write al pixels to white. However, after downloading the Mat again from the GPU and visualizing it with imshow, not all the pixels are white and some weird black lines are present, which makes me believe that somehow I am writing to invalid memory addresses.

My guess is the following. The OpenCV documentation states that cv::mat::data returns an uchar pointer. However, my Mat has a data type "16U" (short unsigned to my knowledge). That is why in the kernel launch I am casting the pointer to (unsigned short*). But apparently that is incorrect.

How should I correctly proceed to be able to read and write the Mat data as short in my kernel?

1

1 Answers

2
votes

First of all, the input image type should be short instead of unsigned short because the type of Mat is 16SC3 ( rather than 16UC3 ).

Now, since the image step is in bytes and the data type is short, the pixel index ( or address ) should be calculated taken into account the difference in byte width of those. There are 2 ways to fix this issue.

Method 1:

__global__ void my_kernel( short* img, int width, int height, int img_step)
{
    int x, y, pixel;
    y = blockIdx.y * blockDim.y + threadIdx.y;
    x = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (y >= height)
        return;
    
    if (x >= width)
        return;

    //Reinterpret the input pointer as char* to allow jump in bytes instead of short
    char* imgBytes = reinterpret_cast<char*>(img);
    
    //Calculate row start address using the newly created pointer
    char* rowStartBytes = imgBytes + (y * img_step); // Jump in byte
    
    //Reinterpret the row start address back to required data type.
    short* rowStartShort = reinterpret_cast<short*>(rowStartBytes);
    
    short* pixelAddress = rowStartShort + ( 3 * x ); // Jump in short
    
    //Modify the image values
    pixelAddress[0] = 255; 
    pixelAddress[1] = 255;
    pixelAddress[2] = 255;

}

Method 2:

Divide the input image step by the size of required data type (short). It may be done when passing the step as a kernel argument.

my_kernel<<<grid,block>>>( img, width, height, img_step/sizeof(short));

I have used method 2 for quite a long time. It is a shortcut method, but later on when I got to look at the source code of certain image processing libraries, I realized that actually Method 1 is more portable, since the size of type can vary across different platforms.