8
votes

Recently when I used cuSparse and cuBLAS in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSPARSE is much slower than cuBLAS in all cases!

In all my experiments, I used cusparseScsrmm in cuSparse and cublasSgemm in cuBLAS. In the sparse matrix, half of the total elements are zero. The GPU I used is NVIDIA Titan Black. Besides, all the time consumed is obtained through nvvp tool provided by NVIDIA. Below are some of the results:

Experiment A:

  1. sparse matrix size: 192x2400
  2. dense matrix size: 2400x256
  3. cusparse time: 1.4ms
  4. cublas time: 0.21ms

Experiment B:

  1. sparse matrix size: 192x75
  2. dense matrix size: 75x1024
  3. cusparse time: 0.27ms
  4. cublas time: 0.04ms

So, it's very odd to see the results listed above. Because cuSPARSE is designed particularly to handle sparse matrix manipulation, how could it be even slower than cuBLAS!? If so, then there is no need to use cuSPARSE at all. Could you please give me any explanation to the results? Also, could you suggest any other ways to speed up sparse matrix multiplication?

1

1 Answers

16
votes

I don't think that you can classify a matrix with half zeros as "sparse": the timing you have found are reasonable (actually the sparse algorithm is behaving pretty well!).

Sparse algorithms are efficient only when considering matrices where most of the elements are zeros (for example, matrices coming out from finite elements problems).

This holds true for CPUs, non only for GPUs: there's an important overhead in treating the matrix as sparse, and it become convenient to use sparse algorithms only when... most of the elements are zeros (typical: ten or less non-zeros per row, matrix of rank thousands - hundred thousands - (millions?) ).

There are other matrix shapes that have efficient solution algorithms, that you can try if it applies to your problem, e.g. band matrices. I don't know whether they have been ported to cuBlas though.

About the overheads

Dense linear algebra algorithms can perform optimally because processors have been designed in order to best efficiently solve for such systems. Consider the DGEMM operation (matrix-matrix multiply): it's an operation that let you use the processors at >95% of it's theoretical peak floating point performance, for large matrices (ie, matrices not fitting any cache of the system). How?

  • prefetching
  • optimal cache usage
  • vectorization (SSE, AVX)
  • pipelining

In a sparse LA algorithm only non-zero elements and their corresponding indexes are stored into memory: memory accesses are in fact indirect. So the sparse algorithm cannot exploit the hardware at the same level of optimization: I don't know about specific figures in this context, but 10 to 20% wouldn't be strange.

The gain is clearly that operations on zeros (on non-stored elements) are simply not performed, resulting in order of magnitudes less operations and much less needed storage.

There are further overheads in integers logics, conditionals, but modern CPUs are pretty good in overlapping integer and FP operations, and with "speculative executions". Unfortunately they too can prevent vectorization and so are further overheads with respect to the dense case.

What about GPUs?

Dense LA algorithm are an optimal fit for GPUs as the same as for CPUs: in this case you have optimal usage of:

  • coalescing
  • shared memory
  • memory access patterns

Again the indirect access to matrices elements in sparse LA algorithm prevent to exploit the same level of optimization.

References

I can't remember which one I used when encountered sparse problems... I think it was PSBLAS: http://people.uniroma2.it/salvatore.filippone/psblas/

But here you will be overwhelmed of them: http://www.netlib.org/utk/people/JackDongarra/la-sw.html