In my app each thread needs it's own matrix of data. Let's say, I have T
threads, and each thread works with different matrix D[M][N]
.
My question: how to organize the data structure?
My solution: I define an array A
of T*M*N
elements. To avoid bank conflicts, I store firstly element D[0][0]
T
times for each thread, then D[0][1]
... D[0][M-1]
, D[1][0]
and so on (if you look at this array like at matrix T * (M*N)
, you'll have one column for each thread). In this way I have the same elements for different threads in different memory banks. Correspondingly, I access element D[i][j]
for thread x
in the following way: D[i][j](x) == A[T * (M * i + j) + x]
.
My problem: it's computationally expensive to calculate complicated indexes.
P.S. I have Nvidia Tesla C2075 (CUDA 2.0).
const
qualifier in the kernel argument list to help handle bank conflicts. In general, duplicating values to avoid conflicts may be counter-productive as it renders L1 and L2 caching less efficient. Only consider more complicated solutions after having verified with the profiler that the most simple solution is not optimal. It could be that your algorithm is compute bound, rendering how you address memory a moot point. – Roger Dahlcolumn
dimension. In that case I suspect that neither constant, nor shared memory will help you. If the number of elements in per column is bigger than the number per row you should think in another approach to compute your problem. – pQB