Access to each matrix point

Question

I have a problem when I try to get access to each point of matrix in CUDA kernel. I'm working with OpenCV and I'm trying to "do something" on each point of matrix. So, I'm converting uint8_t matrix to float matrix like this:

for(int i=0; i<inputMatrix.rows; ++i){
    for(int j=0; j<inputMatrix.cols * cn; j+=cn){                           
        examMatrix[i*inputMatrix.cols*cn + j + 0] = pixelPtr[i*inputMatrix.cols*cn + j + 0]; // B
        examMatrix[i*inputMatrix.cols*cn + j + 1] = pixelPtr[i*inputMatrix.cols*cn + j + 1]; // G
        examMatrix[i*inputMatrix.cols*cn + j + 2] = pixelPtr[i*inputMatrix.cols*cn + j + 2]; // R               
    }
}

And this works for 3 channels image cause if I created output image from this matrix (after back conversion to uint8_t) looks same as input. But I want to make some changes using CUDA:

I set block size and grid size like this:

dim3 dimBlock(count, 3);
dim3 dimGrid( frameHeight/count, frameWidth/count);

Where count is thread number, 3 is channel number, frameHeight and frameWidth are frame size.

So, I allocated GPUexamMatrix and GPUresultMatrix and tried to access to each point in kernel. My kernel looks like this:

resultMatrix[(blockIdx.x * blockIdx.y) + (threadIdx.x * threadIdx.y)] = examMatrix[(blockIdx.x * blockIdx.y) + (threadIdx.x * threadIdx.y)];

So, as you can see I tried to simply copy matrix. After this operation, when I returned my matrix to host and printed it I've got really small or really big float numbers inside matrix, but not the numbers from examine matrix.

I suppose I'm doing something wrong inside kernel. Any ideas?

that looks like uninitialized values. or broken values due to a race condition. you should run cuda-memcheck on your application to narrow it down. — Pavel
frameHeight/count - you sure it's fine with integer division? — Pavel
I'm pretty shure it's fine, cause I'm testing when count = 1. — caro
In cuda-memcheck I've got: ========= CUDA-MEMCHECK ========= Program hit error 11 on CUDA API call to cudaMemcpy ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so [0x311b30] ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.6.0 (cudaMemcpy + 0x1df) [0x33f9f] ========= Host Frame:./gauss_gpu [0x2381] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21de5] ========= Host Frame:./gauss_gpu [0x1c79] ========= — caro

Robert Crovella Robert Crovella · Accepted Answer · 2014-06-04T19:27:58

Your usage of the built-in variables can't possibly be correct. To give just one simple example, threadIdx.x = 0 and threadIdx.y = 2 will access the same point as threadIdx.x = 2 and threadIdx.y = 0. You don't have unique indexing. I could try and give you something that will work, but I'm a little confused by your dimBlock variable. In short, this is not how I would set up grid/block/indexing to handle a 2D array. I wouldn't use 3, the channel number, in my threadblock dimensioning

Try something like this:

// make sure count is small like 16: count*count<512 or 1024 depending on GPU
dim3 dimBlock(count, count);
dim3 dimGrid( frameWidth/dimBlock.x, frameHeight/dimBlock.y);

and in your kernel:

int row = threadIdx.y + blockIdx.y*blockDim.y;
int col = threadIdx.x + blockIdx.x*blockDim.x;

resultMatrix[3*(row*frameWidth + col) + 0] = examMatrix[3*(row*frameWidth + col) + 0];
resultMatrix[3*(row*frameWidth + col) + 1] = examMatrix[3*(row*frameWidth + col) + 1];
resultMatrix[3*(row*frameWidth + col) + 2] = examMatrix[3*(row*frameWidth + col) + 2];

The above assumes frameWidth and frameHeight are evenly divisible by count

Access to each matrix point

1 Answers