Cuda coalesced memory load behavior

Question

I am working with an array of structure, and I want for each block to load in shared memory one cell of the array. For example : block 0 will load array[0] in shared memory and block 1 will load array[1].

In order to do that I cast the array of structure in float* in order to try to coalesce memory access.

I have two version of the code

Version 1

__global__ 
void load_structure(float * label){

  __shared__ float shared_label[48*16];
  __shared__ struct LABEL_2D* self_label;


  shared_label[threadIdx.x*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) +threadIdx.x*16+threadIdx.y];
  shared_label[(threadIdx.x+16)*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) + (threadIdx.x+16)*16+threadIdx.y];
  if((threadIdx.x+32)*16+threadIdx.y < sizeof(struct LABEL_2D)/sizeof(float))  {
    shared_label[(threadIdx.x+32)*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) +(threadIdx.x+32)*16+threadIdx.y];
   }

  if(threadIdx.x == 0){
    self_label = (struct LABEL_2D *) shared_label;
  }
  __syncthreads();
  return;
}

...

dim3 dimBlock(16,16);
load_structure<<<2000,dimBlock>>>((float*)d_Label;

Computation time : 0.740032 ms

Version 2

__global__ 
void load_structure(float * label){

  __shared__ float shared_label[32*32];
  __shared__ struct LABEL_2D* self_label;

  if(threadIdx.x*32+threadIdx.y < *sizeof(struct LABEL_2D)/sizeof(float))
    shared_label[threadIdx.x*32+threadIdx.y] = 
              label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float)+threadIdx.x*32+threadIdx.y+];


  if(threadIdx.x == 0){
      self_label = (struct LABEL_2D *) shared_label;
    }
  __syncthreads();
  return;
}

dim3 dimBlock(32,32);
load_structure<<<2000,dimBlock>>>((float*)d_Label);

Computation time : 2.559264 ms

In both version I used the nvidia profiler and the global load efficiency is 8%.

I have two problems : 1 - I don't understand why there is a difference of timings. 2 - Are my calls coalesced?

I am using a video card with 2.1 compute capability (32 thread/wraps)

Compiler might do the optimizations in sense of eliminating of useless code. Hence, as your threads actually have no effect on global memory, compiler could eliminate the code and as you execute four times more threads in the second, solution you get app. four times greater computation time. Check the ptx-output of compiler to confirm my assumption. — stuhlo

Robert Crovella Robert Crovella · Accepted Answer · 2013-03-25T22:14:12

Your global loads are not coalesced. 8% is pretty low, the worst you can possibly do is 3%.

I believe the principal reason for this is the way you are indexing based on threadIdx.x and threadIdx.y. Let's consider this line of code from the 2nd kernel (the first kernel has similar issues):

shared_label[threadIdx.x*32+threadIdx.y] =  label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float)+threadIdx.x*32+threadIdx.y];

In particular, consider this indexing:

threadIdx.x*32+threadIdx.y

CUDA warps are grouped in the order of X,Y,Z. That means the rapidly varying indices in a warp will tend to be on the X index first, then on Y, then on Z. So if I have a 16x16 threadblock, for example, the first warp will have threads that have threadIdx.x spanning from 0 to 15 and threadIdx.y spanning only 0 to 1. Adjacent threads in this case will mostly have adjacent threadIdx.x indices.

The result for your code is that you have broken coalescing due to your indexing. If you can restructure your loading and storage to use this type of indexing:

threadIdx.y*32+threadIdx.x

You will suddenly see a significant improvement in your global load efficiency. (Your shared memory usage may be better as well.)

I realize you have 2 questions and I'm puzzled when I think about the first one. You've told us the "computation time" is approx. 4 times longer for the second implementation, but presumably you are referring to the compute_interpolation kernel, for which you haven't shown any details at all, except in the 2nd case you are launching 4 times as many threads. Perhaps there is no mystery here. You haven't shown any code. And using a kernel to load a bunch of stuff in shared memory and then exit makes no sense either. Shared memory contents do not persist from one kernel call to the next.

Cuda coalesced memory load behavior

2 Answers