Shared memory bank conflict in CUDA Fortran when loading 2D data from global memory

Question

I am accessing global memory to load data to shared memory and would like to know if there is a bank conflict. Here is the setup:

In global memory: g_array. A 2D matrix of size (256, 64)

This is how I load the array data from global memory to shared memory. I called the kernel with gridDim (4, 1) and blockDim (16, 16).

d_j = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
d_l = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
tIdx = threadIdx%x -1 
tIdy = threadIdx%y -1

real, shared :: s_array(0:15,0:15)

s_array(tIdx,tIdy) = g_array(d_j,d_l)
doSomthingwithMySharedMemoryData()
.....

You're loading 10 different values into the same location in shared memory. The code doesn't make sense to me. — Robert Crovella
Robert thank you. I edited and modified so that it makes sense. — Adjeiinfo
In CUDA C, one way to check if bank conflicts occur is to use the Visual Profiler. You may wish to check if a similar possibility does exist also for Fortran. I have (fortunately :-) ) abandoned programming in Fortran since some years now, but I think in 2010 we had an experience with PGI Fortran and we found useful the transpose example. You can take a look at these two documents, CUDA Fortran for Scientists and Engineers and CUDA Fortran Device Kernels. — Vitality
They discuss the occurrence of bank conflicts and point out possible remedies. You may also wish to take a look at the recent document An Efficient Matrix Transpose in CUDA Fortran discussing the same topic. Of course, this comment just adds a very small piece of information to what Robert Crovella has already answered. — Vitality
The standalone visual profiler (nvvp) can be used on PGI Fortran programs as well. As @JackOLantern points out, that would be a useful double-check on anything I might say. — Robert Crovella

Robert Crovella Robert Crovella · Accepted Answer · 2013-09-07T19:56:34

I haven't actually run your code, and my fortran is not as good as my c/c++, but I believe generally speaking your code should coalesce well (on global memory accesses) and not have bank conflicts (on shared mem accesses).

The important factor is that you have matched the threadIdx%x index with the rapidly-varying matrix subscript, which in fortran is the first index (since fortran is stored in column-major order) whereas in c/c++ it is the second (or last) index (since c/c++ matrices are stored in row-major order).

Since you're not doing anything else with the subsripts other than using the thread indices directly, there should be no issue.

In general, with accesses like this, the same rules you use to achieve global memory coalesced access will also allow you to avoid bank conflicts on shared memory.

Shared memory bank conflict in CUDA Fortran when loading 2D data from global memory

1 Answers