4
votes

I'm trying to take performance advantages from built-in Sparse Matrix Multiplication API of Tensorflow. And keveman recommended that tf.embedding_lookup_sparse is the right way.

But, it seems that the performance of embedding_lookup_sparse is somewhat disappointed in my experiments. Though it performs fairly small matrix multiplications, <1, 3196> and <3196, 1024>, sparse matmul with 0.1 sparsity fails to win the dense matrix multiplication.

If my implementation is correct, I think one of the reasons is that Tensorflow uses COO format which saves all index-nonzero pair. I'm not an expert on this domain but, isn't it widely known that CSR format is more performant on this kind of computation? Is there any obvious reason that Tensorflow internally uses COO format other than CSR for sparse matrix representation?

1

1 Answers

6
votes

Just for the record, you say matrix multiplication, but one of your matrices is in fact a vector (1 x 3196). So this would make it a matrix-vector multiplication (different BLAS kernel). I will assume you mean matrix-vector multiplication for my answer.

Yes, CSR should theoretically be faster than COO for matrix-vector multiplication; this is because the storage size in CSR format is O(2nnz + n) vs O(3nnzs) and the sparse matrix vector multiplication is in many cases memory bound.

The exact performance difference compared to a dense matrix multiplication varies though based on the problem size, sparsity pattern, data type and implementation. It is difficult to say off the bat which should be faster, because the sparse storage format introduces indirection, which potentially leads to reduced locality and poor(er) utilisation of arithmetic units (e.g. no use of vectorisation).

Particularly when the matrix and vector size are so small that almost everything fits in cache, I would expect limited performance benefits. Sparse matrix structures are typically more useful for truly large matrices, ranging from 10sK x 10sK to 1B x 1B, which wouldn't even fit in main memory using a dense representation. For small problem sizes, in my experience, the storage advantage compared to dense formats is usually negated by the loss in locality and arithmetic efficiency. To some extent this is addressed by hybrid storage formats (such as Block CSR) which try to take the best of both worlds, and are very useful for some applications (doesn't look like tensorflow supports this).

In tensorflow, I would assume the COO format is used because it is more efficient for other operations, for example it supports O(1) updates, insertions and deletions from the data structure. It seems reasonable to trade ~50% performance in sparse matrix-vector multiply to improve performance on these operations.