2
votes

I am working with an array of structure, and I want for each block to load in shared memory one cell of the array. For example : block 0 will load array[0] in shared memory and block 1 will load array[1].

In order to do that I cast the array of structure in float* in order to try to coalesce memory access.

I have two version of the code

Version 1

__global__ 
void load_structure(float * label){

  __shared__ float shared_label[48*16];
  __shared__ struct LABEL_2D* self_label;


  shared_label[threadIdx.x*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) +threadIdx.x*16+threadIdx.y];
  shared_label[(threadIdx.x+16)*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) + (threadIdx.x+16)*16+threadIdx.y];
  if((threadIdx.x+32)*16+threadIdx.y < sizeof(struct LABEL_2D)/sizeof(float))  {
    shared_label[(threadIdx.x+32)*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) +(threadIdx.x+32)*16+threadIdx.y];
   }

  if(threadIdx.x == 0){
    self_label = (struct LABEL_2D *) shared_label;
  }
  __syncthreads();
  return;
}

...

dim3 dimBlock(16,16);
load_structure<<<2000,dimBlock>>>((float*)d_Label;

Computation time : 0.740032 ms

Version 2

__global__ 
void load_structure(float * label){

  __shared__ float shared_label[32*32];
  __shared__ struct LABEL_2D* self_label;

  if(threadIdx.x*32+threadIdx.y < *sizeof(struct LABEL_2D)/sizeof(float))
    shared_label[threadIdx.x*32+threadIdx.y] = 
              label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float)+threadIdx.x*32+threadIdx.y+];


  if(threadIdx.x == 0){
      self_label = (struct LABEL_2D *) shared_label;
    }
  __syncthreads();
  return;
}

dim3 dimBlock(32,32);
load_structure<<<2000,dimBlock>>>((float*)d_Label);

Computation time : 2.559264 ms

In both version I used the nvidia profiler and the global load efficiency is 8%.

I have two problems : 1 - I don't understand why there is a difference of timings. 2 - Are my calls coalesced?

I am using a video card with 2.1 compute capability (32 thread/wraps)

2
Compiler might do the optimizations in sense of eliminating of useless code. Hence, as your threads actually have no effect on global memory, compiler could eliminate the code and as you execute four times more threads in the second, solution you get app. four times greater computation time. Check the ptx-output of compiler to confirm my assumption.stuhlo

2 Answers

2
votes

Your global loads are not coalesced. 8% is pretty low, the worst you can possibly do is 3%.

I believe the principal reason for this is the way you are indexing based on threadIdx.x and threadIdx.y. Let's consider this line of code from the 2nd kernel (the first kernel has similar issues):

shared_label[threadIdx.x*32+threadIdx.y] =  label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float)+threadIdx.x*32+threadIdx.y];

In particular, consider this indexing:

threadIdx.x*32+threadIdx.y

CUDA warps are grouped in the order of X,Y,Z. That means the rapidly varying indices in a warp will tend to be on the X index first, then on Y, then on Z. So if I have a 16x16 threadblock, for example, the first warp will have threads that have threadIdx.x spanning from 0 to 15 and threadIdx.y spanning only 0 to 1. Adjacent threads in this case will mostly have adjacent threadIdx.x indices.

The result for your code is that you have broken coalescing due to your indexing. If you can restructure your loading and storage to use this type of indexing:

threadIdx.y*32+threadIdx.x

You will suddenly see a significant improvement in your global load efficiency. (Your shared memory usage may be better as well.)

I realize you have 2 questions and I'm puzzled when I think about the first one. You've told us the "computation time" is approx. 4 times longer for the second implementation, but presumably you are referring to the compute_interpolation kernel, for which you haven't shown any details at all, except in the 2nd case you are launching 4 times as many threads. Perhaps there is no mystery here. You haven't shown any code. And using a kernel to load a bunch of stuff in shared memory and then exit makes no sense either. Shared memory contents do not persist from one kernel call to the next.

0
votes

I solved my problem, the access memory pattern was not correct in the previous version. After reading the paragraph 6.2.1 of the cuda best practise guide, I discover that the access are faster if they are aligned.

In order to aligne my access pattern, I added a "fake" variable in the structure in order to have a structure size that can be divided by 128 (cash size line).

With this strategie I obtain good performance : In order to load 2000 structure into 2000 block it took only 0.16ms.

Here is the version of the code :

struct TEST_ALIGNED{
  float data[745];
  float aligned[23];
}; 


__global__
void load_structure_v4(float * structure){

  // Shared structure within a block
  __shared__ float s_structure[768];
  __shared__ struct TEST_ALIGNED * shared_structure;

  s_structure[threadIdx.x] = 
    structure[blockIdx.x*sizeof(struct TEST_ALIGNED)/sizeof(float) + threadIdx.x];
  s_structure[threadIdx.x + 256] = 
    structure[blockIdx.x*sizeof(struct TEST_ALIGNED)/sizeof(float) + threadIdx.x + 256];
  if(threadIdx.x < 745)
        s_structure[threadIdx.x + 512] = 
            structure[blockIdx.x*sizeof(struct TEST_ALIGNED)/sizeof(float) +    threadIdx.x + 512];
  if(threadIdx.x == 0)
       shared_structure = (struct TEST_ALIGNED*) s_structure;

  __syncthreads();

    return;
}

dim3 dimBlock(256);
load_structure_v4<<<2000,dimBlock>>>((float*)d_test_aligned);

I am still looking for optimization, and I will post it here if I find some.