Just for the record, you say matrix multiplication, but one of your matrices is in fact a vector (1 x 3196). So this would make it a matrix-vector multiplication (different BLAS kernel). I will assume you mean matrix-vector multiplication for my answer.
Yes, CSR should theoretically be faster than COO for matrix-vector multiplication; this is because the storage size in CSR format is O(2nnz + n)
vs O(3nnzs)
and the sparse matrix vector multiplication is in many cases memory bound.
The exact performance difference compared to a dense matrix multiplication varies though based on the problem size, sparsity pattern, data type and implementation. It is difficult to say off the bat which should be faster, because the sparse storage format introduces indirection, which potentially leads to reduced locality and poor(er) utilisation of arithmetic units (e.g. no use of vectorisation).
Particularly when the matrix and vector size are so small that almost everything fits in cache, I would expect limited performance benefits. Sparse matrix structures are typically more useful for truly large matrices, ranging from 10sK x 10sK to 1B x 1B, which wouldn't even fit in main memory using a dense representation. For small problem sizes, in my experience, the storage advantage compared to dense formats is usually negated by the loss in locality and arithmetic efficiency. To some extent this is addressed by hybrid storage formats (such as Block CSR) which try to take the best of both worlds, and are very useful for some applications (doesn't look like tensorflow
supports this).
In tensorflow
, I would assume the COO format is used because it is more efficient for other operations, for example it supports O(1)
updates, insertions and deletions from the data structure. It seems reasonable to trade ~50% performance in sparse matrix-vector multiply to improve performance on these operations.