I was able to understand the "l2_subp0_read_sector_misses" and "l2_subp1_read_sector_misses" by going through this post. Now I have a similar question about events "l2_subp0_write_sector_misses" and "l2_subp1_write_sector_misses".
Lets first take the same example as in the given link (vector add)
Kernel code:
__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;
for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
C[i] = A[i] + B[i];
}
}
Here, I also copy array C from host to device. Therefore, C array must be in L2 cache (total size of 3 arrays is less than the size of L2 cache.). But still I see all the write accesses to C as L2 cache misses according to the nvprof results.
Is this the expected behavior? Are there any situations where we can expect L2 write cache hits or are L2 cache write access always become misses?
Thanks.