2
votes

I use cusparse and cublas to compute a sparse-dense multiplication: C = A’ * B.

A is a M*N sparse matrix

B is a M*S dense matrix

M = 9,633,792, N = 617,004, nnz is 28,901,376, S = 3

I have tried different method to make it faster,

  1. A is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 180ms

  2. A’ = At is stored in CSR format, use cusparseScsrmm2 to compute At*(B’)’, there transposing B to improve the memory access of matrix B, and according to the document, if op(B) = B^T, only op(A) = A is supported, so I stored At in CSR form in advance, it takes 8ms to transpose B, and 4ms to compute At*(B’)’, 12ms altogether.

  3. A’ = At is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 8 ms.

A is constant in iteration, so time of operating on A could not be considered, but time of operating on B should be considered. More specifically, A is a Binary Matrix, it has 3 non-zero value is every row.

So I’m wandering is there any method could speed it up? 4ms may be acceptable. For example, to improve the memory access of matrix B but not time consuming. I also considered using constant memory to store A, but cuda seems to have only 64K constant memory, or using texture memory to store B, however it is read-only memory, may be not suitable.

/**** supplement ***/

The GPU I used is GTX TITAN X, and I use cublasSgeam to transpose matrix B

1
How do you transpose B? geam()? It seems much slower than possible peak.kangshiyin
yes, I will try to write a kernel by your advice @kangshiyinzjhthu

1 Answers

1
votes

I don't know which device you are using. But the matrix transpose speed is quite slow compared to a single D2D copy of B on high-end GPU.

M*S*sizeof(float)/8e-3/1e9=14GB/s

If you use cublas_geam() to transpose the matrix B, it probably means the matrix is too thin and the routine is not well optimized for this case. You could implement your own transpose kernel and optimize it for 3-col matrix. General matrix transpose works well only on dimensions with a multiple of 32 or much larger than 32. But its code is a good start to implement your own.

https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/