I'm running into (what I believe are) shared-memory bank conflicts in a CUDA kernel. The code itself is fairly complex, but I reproduced it in the simple example attached below.
In this case it is simplified to a simple copy from global -> shared -> global memory, of a 2D array of size 16x16, using a shared-memory array which might be padded at the right side (variable ng
).
If I compile the code with ng=0
and examine the shared memory access pattern with NVVP, it tells me that there are "no issues". With e.g. ng=2
I get "Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1" at the lines marked with "NVVP warning". I don't understand why (or more specific: why the padding causes the warnings).
EDIT as mentioned by Greg Smith below, on Kepler there are 32 banks of 8 bytes wide (http://gpgpu.org/wp/wp-content/uploads/2013/09/08-opti-smem-instr.pdf, slide 18). But I don't see how that changes the problem.
If I understand things correctly, with 32 banks (B1, B2, ..)
of 4 bytes, doubles (D01, D02, ..)
are stored as:
B1 B2 B3 B4 B5 .. B31
----------------------------------
D01 D02 D03 .. D15
D16 D17 D18 .. D31
D32 D33 D34 .. D47
Without the padding, half warps write (as[ijs] = in[ij]
) to shared-memory D01 .. D15
, D16 .. D31
, etc. With padding (of size 2) the first half warp writes to D01 .. D15
, the second after the padding to D18 .. D33
, which still shouldn't cause a bank conflict?
Any idea what might be going wrong here?
Simplified example (tested with cuda 6.5.14):
// Compiled with nvcc -O3 -arch=sm_35 -lineinfo
__global__ void copy(double * const __restrict__ out, const double * const __restrict__ in, const int ni, const int nj, const int ng)
{
extern __shared__ double as[];
const int ij=threadIdx.x + threadIdx.y*blockDim.x;
const int ijs=threadIdx.x + threadIdx.y*(blockDim.x+ng);
as[ijs] = in[ij]; // NVVP warning
__syncthreads();
out[ij] = as[ijs]; // NVVP warning
}
int main()
{
const int itot = 16;
const int jtot = 16;
const int ng = 2;
const int ncells = itot * jtot;
double *in = new double[ncells];
double *out = new double[ncells];
double *tmp = new double[ncells];
for(int n=0; n<ncells; ++n)
in[n] = 0.001 * (std::rand() % 1000) - 0.5;
double *ind, *outd;
cudaMalloc((void **)&ind, ncells*sizeof(double));
cudaMalloc((void **)&outd, ncells*sizeof(double));
cudaMemcpy(ind, in, ncells*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(outd, out, ncells*sizeof(double), cudaMemcpyHostToDevice);
dim3 gridGPU (1, 1 , 1);
dim3 blockGPU(16, 16, 1);
copy<<<gridGPU, blockGPU, (itot+ng)*jtot*sizeof(double)>>>(outd, ind, itot, jtot, ng);
cudaMemcpy(tmp, outd, ncells*sizeof(double), cudaMemcpyDeviceToHost);
return 0;
}
D01..D31
are stored in different banks, andD01
andD32
share a bank? I can't seem to find any detailed info on that. – Bart