understanding Matrix multiplication in CUDA

Question

I am trying to learn CUDA. I started to try matrix multiplication with the help of this article based on GPU. My main problem is that I am unable too understand how to access 2D array in Kernel since accessing a 2D array is a bit different than the conventional method (matrix[i][j]). This is the part where i am stuck:

for (int i = 0; i < N; i++) {
    tmpSum += A[ROW * N + i] * B[i * N + COL];
}
C[ROW * N + COL] = tmpSum;

I could understand how ROW and COLUMN were derived.

int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;

Any explanation with an example is highly appreciated. Thanks!

A single loop in kernel means each workitem is doing a dot product(between 1 row of m1 and 1 column of m2) to find (i,j)th element of result matrix. Since it takes a 1-D array, it can only find it as stacked rows. Thats why ROW * N + i means ROWth row and ith element of that row but this is first matrix. Second matrix seems to be not transposed prior to this kernel, so it scans through a single column instead of a row. — huseyin tugrul buyukisik
why to multiply ROW*N and again I*N.This logic seems to be tricky. I am unable to visualize it @huseyintugrulbuyukisik — uttejh
2D to 1D means first row followed by second row followed by third row .... Then if you multiply ROW by N, you select first element of ROWth row since N is row length and 1D array is of length N*M where M is column height(or it means M rows of each having N elements are stacked one after another in 1D). — huseyin tugrul buyukisik
If you are trying to learn OpenCL, why is this question tagged with CUDA, and why is the code you show CUDA code? — talonmies
Take a look at this course: udacity.com/course/intro-to-parallel-programming--cs344# — zindarod

Yves Daoust Yves Daoust · Accepted Answer · 2017-10-28T14:24:56

Matrices are stored contiguously, i.e. every row after the other at consecutive locations. What you see here is called flat adressing, i.e turning the two element index to an offset from the first element.

understanding Matrix multiplication in CUDA

1 Answers