1
votes

I want to compute a row-sum of an m x n matrix A, or equivalently the column-sum of its transpose A' (I have both in memory so A' costs me nothing extra in computation). I plan to launch m threads each of which can either loop over the n columns of A, or n rows of A'. Which approach will be faster if we assume the matrices are stored in column-major format (i.e. like with CUBLAS)?

My thinking so far (on coalesced memory access):

If I row-sum, then threads in the same block will read from adjacent memory locations at each iteration. Yet equally, if I column-sum instead, then each thread will iterate over a contiguous block of memory. So if I have threads 1, 2 and 3 of the same block, their memory access will look like so (assuming column-major storage):

1 2 3 ... 1 2 3 ... 1 2 3 ... for row-sums
1 1 1 ... 2 2 2 ... 3 3 3 ... for column-sums
  • But this doesn't tell me which will be faster.
  • It also doesn't take into account the behavior at block-level (i.e. if the first block launched sums over rows 1-32, will the 2nd block launched be guaranteed to sum over rows 33-64?)
2

2 Answers

4
votes

"threads in the same block will read from adjacent memory locations at each iteration"

Is faster. That is pretty much the definition of coalesced access.

-1
votes

For a single thread:

for (i = 0 to size)
   for (j = 0 to size)
      array[i][j]

Will be faster than

for (i = 0 to size)
   for (j = 0 to size)
      array[j][i]

Becuase in memory each row is laid out in memory contiguously.

But for multible threads it is not as clear cut. If you spawn M threads on a M core CPU then who knows what will happen. Your L1 cache will be core specific but your L3 shared chache will probably not be very helpful assuming your overall matrix is larger than the size of the cache. I think it's fiar to say that there are too many possibilites to make a definative answer. A few thoughts: