7
votes

While profiling my kernels in Visual Profiler on Kepler hardware, I’ve noticed the profiler shows that global loads and stores are cached in L1. I'm confused because the programming guide and Kepler tuning manual state that:

L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data Cache).

There are no register spills (profiler shows L1 caching even for primitive, 2-lines 'add' kernel) and I'm not sure what 'stack data' means here.

GK110 Whitepaper shows that global accesses will go through L1 cache in all but one case: loads through read-only cache (__ldg). Does it mean that while global accesses go through L1 hardware they are not actually cached? Does it also mean that if I have spilled registers data cached in L1, this data can be evicted as a result of gmem access?

UPDATE: I've realized that I might be misreading the information the profiler is giving to me, so here is the kernel code as well as profiler results (I've tried both on Titan and K40 with the same results).

template<typename T>
__global__ void addKernel(T *c, const T *a, const T *b)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    c[i] = a[i] + b[i];
}

...
// Kernel call
float* x;
float* y;
float* d;
// ...
addKernel<<<1024, 1024>>>(d, x, y);
cudaError_t cudaStatus = cudaDeviceSynchronize();
assert(cudaSuccess == cudaStatus);

Visual Profiler output:

Visual Profiler output

L1 numbers make perfect sense given L1 cache is enabled for gmem accesses. For the loads we have:

65536 * 128 == 2 * 4 * 1024 * 1024

UPDATE 2: added SASS and PTX code. SASS code is very simple and contains reads from constant memory and loads/stores from/to global memory (LD/ST instructions).

Function : _Z9addKernelIfEvPT_PKS0_S3_
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                             /* 0x088cb0a0a08c1000 */
/*0008*/                MOV R1, c[0x0][0x44];                /* 0x64c03c00089c0006 */
/*0010*/                S2R R0, SR_CTAID.X;                  /* 0x86400000129c0002 */
/*0018*/                MOV32I R5, 0x4;                      /* 0x74000000021fc016 */
/*0020*/                S2R R3, SR_TID.X;                    /* 0x86400000109c000e */
/*0028*/                IMAD R2, R0, c[0x0][0x28], R3;       /* 0x51080c00051c000a */
/*0030*/                IMAD R6.CC, R2, R5, c[0x0][0x148];   /* 0x910c1400291c081a */
/*0038*/                IMAD.HI.X R7, R2, R5, c[0x0][0x14c]; /* 0x93181400299c081e */
                                                             /* 0x08a0a4b0809c80b0 */
/*0048*/                IMAD R8.CC, R2, R5, c[0x0][0x150];   /* 0x910c14002a1c0822 */
/*0050*/                IMAD.HI.X R9, R2, R5, c[0x0][0x154]; /* 0x931814002a9c0826 */
/*0058*/                LD.E R3, [R6];                       /* 0xc4800000001c180c */
/*0060*/                LD.E R0, [R8];                       /* 0xc4800000001c2000 */
/*0068*/                IMAD R4.CC, R2, R5, c[0x0][0x140];   /* 0x910c1400281c0812 */
/*0070*/                IMAD.HI.X R5, R2, R5, c[0x0][0x144]; /* 0x93181400289c0816 */
/*0078*/                FADD R0, R3, R0;                     /* 0xe2c00000001c0c02 */
                                                             /* 0x080000000000b810 */
/*0088*/                ST.E [R4], R0;                       /* 0xe4800000001c1000 */
/*0090*/                EXIT ;                               /* 0x18000000001c003c */
/*0098*/                BRA 0x98;                            /* 0x12007ffffc1c003c */
/*00a0*/                NOP;                                 /* 0x85800000001c3c02 */
/*00a8*/                NOP;                                 /* 0x85800000001c3c02 */
/*00b0*/                NOP;                                 /* 0x85800000001c3c02 */
/*00b8*/                NOP;                                 /* 0x85800000001c3c02 */

PTX:

.visible .entry _Z9addKernelIfEvPT_PKS0_S3_(
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_0,
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_1,
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_2
)
{
.reg .s32 %r<5>;
.reg .f32 %f<4>;
.reg .s64 %rd<11>;

ld.param.u64 %rd1, [_Z9addKernelIfEvPT_PKS0_S3__param_0];
ld.param.u64 %rd2, [_Z9addKernelIfEvPT_PKS0_S3__param_1];
ld.param.u64 %rd3, [_Z9addKernelIfEvPT_PKS0_S3__param_2];
cvta.to.global.u64 %rd4, %rd1;
.loc 1 22 1
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r1, %r2, %r3;
cvta.to.global.u64 %rd5, %rd2;
mul.wide.s32 %rd6, %r4, 4;
add.s64 %rd7, %rd5, %rd6;
cvta.to.global.u64 %rd8, %rd3;
add.s64 %rd9, %rd8, %rd6;
.loc 1 23 1
ld.global.f32 %f1, [%rd9];
ld.global.f32 %f2, [%rd7];
add.f32 %f3, %f2, %f1;
add.s64 %rd10, %rd4, %rd6;
.loc 1 23 1
st.global.f32 [%rd10], %f3;
.loc 1 24 2
ret;
}
1
Have a look at this post. L1 cache and read-only cache are different. In comments, Robert has mentioned "Kepler does not normally have L1 enabled for ordinary global loads".Farzad
Thank you for the link. I do understand that L1 and texture caches are different. I'm interested in what Robert meant by "normally" besides what's mentioned in programming guides (i.e. register spills and stack data).Alexey Kamenev
Can you show your .PTX output? I'm thinking there might be low-level instructions resulted from your short kernel that exploit L1 cache to save intermediate results.Farzad
On Fermi and Kepler architectures all generic, global, local, and shared memory access are handled by the L1 cache. Shared memory accesses do not require a tag lookup and do not invalidate a cache line. All local and global memory accesses require a tag access. Uncached global memory stores and reads will invalidate a cache line. On compute capability 3.0 and 3.5 all global access through L1 (LDG on CC 3.5 goes through TEX) will be uncached.Greg Smith
Thank you Greg, this answers my question! If you promote your comment to answer, I will mark my question as answered.Alexey Kamenev

1 Answers

5
votes

On Fermi and Kepler architectures all generic, global, local, and shared memory operations are handled by the L1 cache. Shared memory accesses do not require a tag look up and do not invalidate a cache line. All local and global memory accesses require a tag look up. Uncached global memory stores and reads will invalidate a cache line. On compute capability 3.0 and 3.5 all global memory reads with exception to LDG on CC 3.5 will be uncached. LDG instruction goes through the texture cache.