Why is sparse-dense multiplication faster than dense-sparse multiplication?

Question

I am curious to why multiplying a sparse-matrix by a dense-matrix takes a different time than the reverse. Are the algorithms significantly different?

Here's an example in matlab 2018a:

a=sprand(M,M,0.01);
b=rand(M);
tic;ref1=a*b;t_axb=toc
tic;ref2=b*a;t_bxa=toc

Here's an example with Eigen 3 and C++ using 1 thread:

//prepare acol=MxM ColMajor Eigen sparse matrix with 0.01 density
...
Map<Matrix<double,M,M,ColMajor> > bcol (PR, M, M );
double tic,toc;

tic=getHighResolutionTime();
result=acol*bcol;
toc=getHighResolutionTime();
printf("\nacol*bcol time: %f seconds", (toc - tic));

tic=getHighResolutionTime();
result=bcol*acol;
toc=getHighResolutionTime();
printf("\nbcol*acol time: %f seconds\n", (toc - tic));

When M=4000, the results are:

t_axb =
    0.6877
t_bxa =
    0.4803

acol*bcol time: 0.937590 seconds
bcol*acol time: 0.532622 seconds

When M=10000, the results are

t_axb =
   11.5649
t_bxa =
    9.7872

acol*bcol time: 20.140380 seconds
bcol*acol time: 8.061626 seconds

In both cases, sparse-dense product is slower than dense-sparse product for both Matlab and Eigen. I am curious as to

Why is this the case? Are the algorithms for sparse-dense significantly difference than dense-sparse? The number of FLOPs is the same, right?
Why does eigen match or exceed Matlab's performance for dense-sparse but not for sparse-dense product? A small difference in performance is normal, but a factor of ~1.4-1.8 difference seems strange given that both are highly optimized libraries. I am compiling eigen with all the optimizations as per the documentation. i.e. -fPIC -fomit-frame-pointer -O3 -DNDEBUG -fopenmp -march=native

In Matlab R2017b I get very similar times. For M=10000 I get t_bxa about 10% greater, not smaller — Luis Mendo
Are you generating the same sparse matrices for Matlab and Eigen? — chtz
Not the same matrices, but the same sparsity and size. I've tried with the same matrices and the result is very similar. — avgn

ggael ggael · Accepted Answer · 2018-07-22T16:43:43

You could observe the same difference by comparing column-major versus row-major storage for sparse-matrix times vector product: y = A * x. If A is row-major (equivalently each coeff of y), then each row of A can be treated in parallel without any overhead (no communication, no additional temporary, no additional operation). In contrast, if A is column-major multi-threading cannot come for free, and in most cases the overhead is larger than the gain.

Even without multi-threading, you see that the memory access patterns are very different:

Row-major: multiple random read-only accesses to x, each coeff of y being write one only.
Col-major: each coeff of x is read once, but we get multiple random read-write accesses to the destination y.

So even without multi-threading the situation is naturally favorable to row-major.

Why is sparse-dense multiplication faster than dense-sparse multiplication?

1 Answers