0
votes

I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax

The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.

Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:

  1. How does SPMv multiplication compare in terms of time to DMv? Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?
  2. What should I take into to account to make SpMv the same or better in terms of time than the DMv? Why ppl are saying that the DMv will perform petter than SPMv? Does the storage representation make a difference?
  3. Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.

This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)

1
Sparcity of 50% isn't worth using any sparce matrix format. To make it worthwhile it needs to be much less like 1%.ALX23z
@HossamAmer Sparse formats require at least storing a column or row index per entry as well as the entry. So assuming 32 bit indices and values that is a factor > 2 for storage. And the data is less regular and harder on the cache (also you probably lose vectorization).Joe
@Joe Thanks! I edited my question, Let's just focus only on time... I would like my execution time to be better than DMv I have seen the following link that explains CSR multiplication - Even with this algorithm therein, the time will be worse? I think that memory access is contiguous and cache-friendly + less # of (add & multiply) ops - no?Hossam Amer
Welcome to SO; regarding your 3rd question, please do take some time to read What topics can I ask about here?, and notice that questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for SO.desertnaut
Okay noted- thank you! :)Hossam Amer

1 Answers

2
votes

To answer the edited question:

  1. Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
  2. Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
  3. For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.

Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).