I'm trying to understand how this sample code from CUDA SDK 8.0 works:
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
....
....
This part of the kernel is quite tricky for me. I know that the matrices A and B are represented as array (*float), and I know also the concept of using shared memory in order to compute the dot product thanks to shared memory chuncks.
My problem is that I don't understand the beginning of the code, in particular 3 specific variables (aBegin, aEnd and bBegin). Could someone make me an example drawing of a possible execution, in order to help me understand how the indexes work in this specific case? Thank you
