14
votes

I've always worked with linear shared memory (load, store, access neighbours) but I've made a simple test in 2D to study bank conflicts which results have confused me.

The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.

__global__ void update(int* gIn, int* gOut, int w) {

// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;

// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}

The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)

// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];

// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];

This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:

matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.

Is this related to shared memory arragment? or with threads indexes? Maybe am I missing something?

The kernel configuration is as follow:

// kernel configuration
dim3 dimBlock  = dim3 ( 16, 16, 1 );
dim3 dimGrid   = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);

Thanks in advance.

1
Could you add the block dimensions you are using. I presume it is (16,16,1), but it is good to have it confirmed before answering.talonmies
@talonmies I added the kernel configuration / launch to the question. As you commented I'm using a (16, 16, 1) blockpQB

1 Answers

19
votes

Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:

       bank0 .... bank15
row 0  [ 0   .... 15  ]
    1  [ 16  .... 31  ]
    2  [ 32  .... 47  ]
    3  [ 48  .... 63  ]
    4  [ 64  .... 79  ]
    5  [ 80  .... 95  ]
    6  [ 96  .... 111 ]
    7  [ 112 .... 127 ]
    8  [ 128 .... 143 ]
    9  [ 144 .... 159 ]
    10 [ 160 .... 175 ]
    11 [ 176 .... 191 ]
    12 [ 192 .... 207 ]
    13 [ 208 .... 223 ]
    14 [ 224 .... 239 ]
    15 [ 240 .... 255 ]
       col 0 .... col 15

Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?

The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:

shData[threadIdx.x][threadIdx.y]

threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:

shData[threadIdx.y][threadIdx.x]

threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.