CUDA: relation between load/store efficiency and global memory instruction replay

Question

I am using the NVidia visual profiler (eclipse-based version from CUDA 5.0 beta release), with a Fermi board, and there is something I don't understand about two of the performance metrics:

Global load/store efficiency represents the ratio of the number of actual memory transactions to the requested number of transactions.
Global memory instruction replay, which represents the percentage of instructions issued due to a replay caused by sub-optimal memory coalescing.

I was under the impression that if load/store efficiency is 100% (i.e. perfect coalescing), the global memory instruction replay should be 0, yet I have seen examples with 100% efficiency and non-zero global memory instruction replay. How come ?

Thx

Logan Mayfield Logan Mayfield · Accepted Answer · 2013-10-07T20:37:18

The short answer is that there is a 128 B limit on size for a single warp transaction (due to the bus width I believe). So, if your warp needs needs 256 B of coalesced data, then you have to replay the instruction for the second 128 B.

In general, transactions only move data in 32B, 64B, and 128B segments. If your warp transaction doesn't fit one of those, then you're going to replay the instruction at least once. Coalesced patterns can't avoid this but they do help to minimize transactions. For example, coalesced accesses of Bytes within a warp gets you a 32B Transaction. Coalesced 4B accesses (int or floats) within a warp get you a single 128B transaction.

Consider the following kernel:

__global__ void
gmemtest(const double* const src, double* const dest, const int size,
         const int eleMoved){

  int block_fst = blockIdx.x*blockDim.x*eleMoved;
  size_t thread_fst = block_fst + threadIdx.x*eleMoved;


  #pragma unroll
  for(size_t i = 0; i < eleMoved; i++){
    if( thread_fst + i < size )
      dest[thread_fst + i] = src[thread_fst + i];
  }

Now run it with elemoved at sizes 1,2,4, and 8. You'll find that the replay for the kernel increases as elemoved gets larger. The following host-side loop would hit all of them at block sizes of 128 and 256.

  for(size_t j = 1; j<3; j++){

    for(size_t  i = 1; i<=8; i *= 2){

      size_t n_threads = j*128;
      size_t ele_per_thread = i;

      size_t tot_threads = ((SIZE-1)/ele_per_thread)+1;
      size_t n_blocks = ((tot_threads - 1)/n_threads)+1;

      gmemtest<<<n_blocks,n_threads>>>(d_src,d_dest,SIZE,ele_per_thread);
    }
  }

Running nvprof --print-gpu-trace --metrics inst_replay_overhead we see:

    ==22053== Profiling result:
    Device         Context  Stream   Kernel           Instruction Replay Overhead

   Tesla K20c (0)     1       2    gmemtest(double cons      0.191697
   Tesla K20c (0)     1       2    gmemtest(double cons      0.866548
   Tesla K20c (0)     1       2    gmemtest(double cons      3.472359
   Tesla K20c (0)     1       2    gmemtest(double cons      7.444514
   Tesla K20c (0)     1       2    gmemtest(double cons      0.175090
   Tesla K20c (0)     1       2    gmemtest(double cons      0.912531
   Tesla K20c (0)     1       2    gmemtest(double cons      4.067719
   Tesla K20c (0)     1       2    gmemtest(double cons      7.576686

In practice you might run into this if you're moving something like a warps worth of double2 data.

If you really want to get into performance related issues I can't recommend this talk enough : Micikevicius - "Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them"

CUDA: relation between load/store efficiency and global memory instruction replay

2 Answers