Accelerating global memory random access: Invalidating the L1 cache line
Fermi and Kepler architectures support two types of loads from global memory. Full caching is the
default mode, it attempts to hit in L1, then L2, then GMEM and the load granularity is 128-byte line. L2-only attempts to hit in L2, then GMEM and the load granularity is 32-bytes. For certain random access patterns, memory efficiency can be increased by invalidating L1 and exploiting the lower granularity of L2. This can be done by compiling with –Xptxas –dlcm=cg
option to nvcc
.
General guidelines for accelerating global memory access: disabling ECC support
Fermi and Kepler GPUs support Error Correcting Code (ECC), and ECC is enabled by default. ECC reduces peak memory bandwidth and is requested to enhance data integrity in applications like medical imaging and large-scale cluster computing. If not needed, it can
be disabled for improved performance using the nvidia-smi utility on Linux (see the link), or via Control Panel on Microsoft Windows systems. Note that toggling ECC on or off requires a reboot to take effect.
General guidelines for accelerating global memory access on Kepler: using read-only data cache
Kepler features a 48KB cache for data that is known to be read‐only for
the duration of the function. Use of the read‐only path is beneficial because it offloads the Shared/L1 cache path and it supports
full speed unaligned memory access. Use of the read‐only path can be managed automatically by the compiler (use the const __restrict
keyword) or explicitly (use the __ldg()
intrinsic) by the
programmer.