I use cusparse and cublas to compute a sparse-dense multiplication: C = A’ * B.
A is a M*N sparse matrix
B is a M*S dense matrix
M = 9,633,792, N = 617,004, nnz is 28,901,376, S = 3
I have tried different method to make it faster,
A is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 180ms
A’ = At is stored in CSR format, use cusparseScsrmm2 to compute At*(B’)’, there transposing B to improve the memory access of matrix B, and according to the document, if op(B) = B^T, only op(A) = A is supported, so I stored At in CSR form in advance, it takes 8ms to transpose B, and 4ms to compute At*(B’)’, 12ms altogether.
A’ = At is stored in CSR format, use cusparseScsrmm to compute A’*B, it takes 8 ms.
A is constant in iteration, so time of operating on A could not be considered, but time of operating on B should be considered. More specifically, A is a Binary Matrix, it has 3 non-zero value is every row.
So I’m wandering is there any method could speed it up? 4ms may be acceptable. For example, to improve the memory access of matrix B but not time consuming. I also considered using constant memory to store A, but cuda seems to have only 64K constant memory, or using texture memory to store B, however it is read-only memory, may be not suitable.
/**** supplement ***/
The GPU I used is GTX TITAN X, and I use cublasSgeam
to transpose matrix B