You say that M and N can be few hundred. For that you won't be able to use shared memory much (if at all). You may watch global memory consumption carefully too (although Tesla have a lot of memory)!
200x200 x 3584threads (bare minimum for C2075 in my opinion) x sizeof(int) - that makes 547MB of data.
Global memory access patterns work differently. Global memory is divided into segments of 32, 64 and 128B. The cost of read is approximately the number of different segment access per warp. In short it usually boils down to - the more spread-out your access is - the worse.
So, unless every thread access its own matrix at the same index (at least most of the time), no memory organization will be efficient. However, if the above is true - then the layout you are describing may work.
Also, if you have spread-out access patterns, disabling L1 caching may help. This is because L1 cache line is 128B, but L2 only 32B - so you may reduce over-fetching. At least - try it :)
To ease the pain of accessing the array I would do something like this:
//let the kernel dimentions be known at compile time - you can safe some computation and registers
//assuming one-dimentional kernels
static const int blockSize = ...; //equivalent to blockDim
static const int gridSize = ...; //equivalent to gridDim
static const int rowSize = blockSize * gridSize;
template <typename T, int M, int N>
class MyMatrix {
private:
T* data; //flattened array in global memory
int tid;
public:
__device__ inline MyMatrix(T* dataIsHere) : data(dataIsHere) {
tid = threadIdx.x+blockDim.x*blockIdx.x;
}
__device__ inline T& operator()(int x, int y) {
return data[(y*M+x)*rowSize+tid];
}
}
//assuming the matrix size is 200x200 and consists of ints:
__global__ void myKernel(int* flattenedMatrices) {
MyMatrix<int,200,200> matrix(flattenedMatrices);
...
matrix(2,4) = .... // happily access the matrix for both loads and stores
}
const
qualifier in the kernel argument list to help handle bank conflicts. In general, duplicating values to avoid conflicts may be counter-productive as it renders L1 and L2 caching less efficient. Only consider more complicated solutions after having verified with the profiler that the most simple solution is not optimal. It could be that your algorithm is compute bound, rendering how you address memory a moot point. – Roger Dahlcolumn
dimension. In that case I suspect that neither constant, nor shared memory will help you. If the number of elements in per column is bigger than the number per row you should think in another approach to compute your problem. – pQB