I have a question about the computation of global memory load transactions in CUDA kernel because the profiled value does not match my expected value. Take the following simple matrix multiplication code for example:
__global__ void matmul_kernel(float *A, float* B, float *C, int n)
{
int i, j, k;
float c;
i = blockIdx.x;
for(i=i; i < n; i += gridDim.x){
j = threadIdx.x;
c = 0.0;
for(k = 0; k < n; k++)
c += A[i*n + k]*B[k*n + j];
C[i*n + j] = c;
}
}
dim3 grid(1,1,1);
dim3 block(128,1,1);
n = 128;
matmul_kernel<<<grid, block>>>(A, B, C, n);
I use the simplest matrix multiplication as example. In the above CUDA implementation, I map the i
loop iteration to block index and j
loop to thread index in each thread block. Both the thread block and grid are one dimensional.
Please do not focus on the performance of this implementation. I know it is not efficient as I just use it for experimental purpose.
In this implementation, since I assigned 128 thread in each block, so j
loop can be completely parallelized. But I only assigned 1 block for i
loop, so it will loop n
times. The following figure shows the status of the execution when k=0
. In this status, 128 threads access the first element of A
and 128 first row elements of B
. I execute this CUDA code on Quadro K6000 which uses Kepler 40 architecture and I turned L1 cache on. Since the 128 accesses to B
is coalesced, the number of loads is 128*4/128 = 4
(the 1st 128 is 128 elements, the 2nd 128 is the size of L1 cache line size in bytes, 4 is the bytes for float type). For the 128 accesses to A
, since they access the same element, 1 cache line load should be enough. So the number of global loads is 4+1=5
. But this is only the number of loads when k=0
. k
will be looped for 128 times, and i
will also be looped for 128 times, so the total number of loads is 5*128*128=81920
. However, the profiled global loads is 131072
. This value equals to (4+4)*128*128
. It seems the number of loads to A
at k=0
is 4 not 1. Can anyone explain why the profiled value does not match my expected value?