CUDA. Data Organization

Question

In my app each thread needs it's own matrix of data. Let's say, I have T threads, and each thread works with different matrix D[M][N].

My question: how to organize the data structure?

My solution: I define an array A of T*M*N elements. To avoid bank conflicts, I store firstly element D[0][0] T times for each thread, then D[0][1] ... D[0][M-1], D[1][0] and so on (if you look at this array like at matrix T * (M*N), you'll have one column for each thread). In this way I have the same elements for different threads in different memory banks. Correspondingly, I access element D[i][j] for thread x in the following way: D[i][j](x) == A[T * (M * i + j) + x].

My problem: it's computationally expensive to calculate complicated indexes.

P.S. I have Nvidia Tesla C2075 (CUDA 2.0).

You mention bank conflicts. Are you intending to store each D[M][N] for each thread in shared memory? What are the approximate sizes of M and N? Since you have already flattened things into A[], I'm not sure how computationally expensive it will be. Have you looked at the PTX code to see what the compiler is doing? If each thread is accessing a successive element, it seems like there is just a fixed offset that needs to be added to go from one element to the next. At most it may be one multiply. And if adjacent threads elements are adjacent, you will have opportunities for coalescing. — Robert Crovella
You may be over-engineering your solution. Are the D matrixes read-only? If so, you can use the const qualifier in the kernel argument list to help handle bank conflicts. In general, duplicating values to avoid conflicts may be counter-productive as it renders L1 and L2 caching less efficient. Only consider more complicated solutions after having verified with the profiler that the most simple solution is not optimal. It could be that your algorithm is compute bound, rendering how you address memory a moot point. — Roger Dahl
Are the elements of a matrix need for computing the current matrix of a thread? If no, your problem is basically a pixel-wise operation done sequentially in the column dimension. In that case I suspect that neither constant, nor shared memory will help you. If the number of elements in per column is bigger than the number per row you should think in another approach to compute your problem. — pQB
M and N can be up to few hundreds, so I use global memory. And yes, I have this flattening, but I don't like it, can I avoid it somehow? I'm not very familiar with PTX code, could you please give me some links, where I can read about that and tell me, how looking into PTX code can help me? And I have one multiply to calculate index and one multiply in addition to calculate offset. — Andrew
Roger, it looks like over-engineering to me too, but I don't see the better way. D is not read-only, and I'm not duplicating values, each matrix is different. I don't like complicated solutions, so what is the most simple solution on your opinion? — Andrew

CygnusX1 CygnusX1 · Accepted Answer · 2012-12-05T22:33:19

You say that M and N can be few hundred. For that you won't be able to use shared memory much (if at all). You may watch global memory consumption carefully too (although Tesla have a lot of memory)! 200x200 x 3584threads (bare minimum for C2075 in my opinion) x sizeof(int) - that makes 547MB of data.

Global memory access patterns work differently. Global memory is divided into segments of 32, 64 and 128B. The cost of read is approximately the number of different segment access per warp. In short it usually boils down to - the more spread-out your access is - the worse.

So, unless every thread access its own matrix at the same index (at least most of the time), no memory organization will be efficient. However, if the above is true - then the layout you are describing may work.

Also, if you have spread-out access patterns, disabling L1 caching may help. This is because L1 cache line is 128B, but L2 only 32B - so you may reduce over-fetching. At least - try it :)

To ease the pain of accessing the array I would do something like this:

//let the kernel dimentions be known at compile time - you can safe some computation and registers
//assuming one-dimentional kernels

static const int blockSize = ...; //equivalent to blockDim
static const int gridSize = ...; //equivalent to gridDim
static const int rowSize = blockSize * gridSize;

template <typename T, int M, int N>
class MyMatrix {
private:
  T* data; //flattened array in global memory
  int tid;
public:
  __device__ inline MyMatrix(T* dataIsHere) : data(dataIsHere) {
    tid = threadIdx.x+blockDim.x*blockIdx.x;
  }
  __device__ inline T& operator()(int x, int y) {
    return data[(y*M+x)*rowSize+tid];
  }
}

//assuming the matrix size is 200x200 and consists of ints:

__global__ void myKernel(int* flattenedMatrices) {
  MyMatrix<int,200,200> matrix(flattenedMatrices);

  ...

  matrix(2,4) = .... // happily access the matrix for both loads and stores
}

CUDA. Data Organization

1 Answers