The short answer is that there is a 128 B limit on size for a single warp transaction (due to the bus width I believe). So, if your warp needs needs 256 B of coalesced data, then you have to replay the instruction for the second 128 B.
In general, transactions only move data in 32B, 64B, and 128B segments. If your warp transaction doesn't fit one of those, then you're going to replay the instruction at least once. Coalesced patterns can't avoid this but they do help to minimize transactions. For example, coalesced accesses of Bytes within a warp gets you a 32B Transaction. Coalesced 4B accesses (int or floats) within a warp get you a single 128B transaction.
Consider the following kernel:
__global__ void
gmemtest(const double* const src, double* const dest, const int size,
const int eleMoved){
int block_fst = blockIdx.x*blockDim.x*eleMoved;
size_t thread_fst = block_fst + threadIdx.x*eleMoved;
#pragma unroll
for(size_t i = 0; i < eleMoved; i++){
if( thread_fst + i < size )
dest[thread_fst + i] = src[thread_fst + i];
}
Now run it with elemoved
at sizes 1,2,4, and 8. You'll find that the replay for the kernel increases as elemoved
gets larger. The following host-side loop would hit all of them at block sizes of 128 and 256.
for(size_t j = 1; j<3; j++){
for(size_t i = 1; i<=8; i *= 2){
size_t n_threads = j*128;
size_t ele_per_thread = i;
size_t tot_threads = ((SIZE-1)/ele_per_thread)+1;
size_t n_blocks = ((tot_threads - 1)/n_threads)+1;
gmemtest<<<n_blocks,n_threads>>>(d_src,d_dest,SIZE,ele_per_thread);
}
}
Running nvprof --print-gpu-trace --metrics inst_replay_overhead
we see:
==22053== Profiling result:
Device Context Stream Kernel Instruction Replay Overhead
Tesla K20c (0) 1 2 gmemtest(double cons 0.191697
Tesla K20c (0) 1 2 gmemtest(double cons 0.866548
Tesla K20c (0) 1 2 gmemtest(double cons 3.472359
Tesla K20c (0) 1 2 gmemtest(double cons 7.444514
Tesla K20c (0) 1 2 gmemtest(double cons 0.175090
Tesla K20c (0) 1 2 gmemtest(double cons 0.912531
Tesla K20c (0) 1 2 gmemtest(double cons 4.067719
Tesla K20c (0) 1 2 gmemtest(double cons 7.576686
In practice you might run into this if you're moving something like a warps worth of double2
data.
If you really want to get into performance related issues I can't recommend this talk enough : Micikevicius - "Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them"