I tried to count the number of DRAM (global memory) accesses for simple vector add kernel.
__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;
for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
C[i] = A[i] + B[i];
}
}
Grid Size = 180 Block size = 128
size of array = 180 * 128 * N floats where N is input parameter (elements per thread)
when N = 1, size of array = 180 * 128 * 1 floats = 90KB
All arrays A, B and C should be read from DRAM.
Therefore theoretically,
DRAM writes (C) = 2880 (32 byte accesses) DRAM reads (A,B) = 2880 + 2880 = 5760 (32 byte accesses)
But when I used nvprof
DRAM writes = fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880 (32 byte accesses) DRAM reads = fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30 (32 byte accesses)
Now this is the problem. Theoretically there should be 5760 DRAM reads, but nvprof only reports 30, for me this looks impossible. Further more, if you double the size of the vector (N = 2), still the reported DRAM accesses remains at 30.
It would be great, if someone can shed some light.
I have disabled the L1 cache by using compiler option "-Xptxas -dlcm=cg
"
Thanks, Waruna