CUDA tiled matrix multiplication explanation

Question

I'm trying to understand how this sample code from CUDA SDK 8.0 works:

template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block
int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A
int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B
int bStep  = BLOCK_SIZE * wB;

....
....

This part of the kernel is quite tricky for me. I know that the matrices A and B are represented as array (*float), and I know also the concept of using shared memory in order to compute the dot product thanks to shared memory chuncks.

My problem is that I don't understand the beginning of the code, in particular 3 specific variables (aBegin, aEnd and bBegin). Could someone make me an example drawing of a possible execution, in order to help me understand how the indexes work in this specific case? Thank you

Jérôme Richard Jérôme Richard · Accepted Answer · 2021-03-03T19:25:17

Here is a drawing to understand the values set to the first variables of the CUDA kernel and the overall computation performed:

Matrices are stored using a row-major ordering. The CUDA code assume the matrix sizes can be divided by BLOCK_SIZE.

The matrices A, B and C are virtually split in blocks according to the kernel CUDA grid. All blocks of C can be computed in parallel. For a given dark-grey block of C, the main loop walk through the several light-grey blocks of A and B (in lockstep). Each block is computed in parallel using BLOCK_SIZE * BLOCK_SIZE threads.

bx and by are the block-based position of the current block within the CUDA grid. tx and ty are the cell-based position of the cell computed by the current thread within the current computed block of the CUDA grid.

Here is a detailed analysis for the aBegin variable: aBegin refers to the memory location of the first cell of the first computed block of the matrix A. It is set to wA * BLOCK_SIZE * by because each block contains BLOCK_SIZE * BLOCK_SIZE cells and there is wA / BLOCK_SIZE blocks horizontally and by blocks above the current computed block of A. Thus, (BLOCK_SIZE * BLOCK_SIZE) * (wA / BLOCK_SIZE) * by = BLOCK_SIZE * wA * by.

The same logic apply for bBegin: it is set to BLOCK_SIZE * bx because there is bx block of size BLOCK_SIZE in memory before the first cell of the first computed block of the matrix B.

a is incremented by aStep = BLOCK_SIZE in the loop so that the next computed block is the following on the right (on the drawing) of the current computed block of A. b is incremented by bStep = BLOCK_SIZE * wB in the same loop so that the next computed block is the following of the bottom (on the drawing) of the current computed block of B.

CUDA tiled matrix multiplication explanation

1 Answers