5
votes

I am curious to why multiplying a sparse-matrix by a dense-matrix takes a different time than the reverse. Are the algorithms significantly different?

Here's an example in matlab 2018a:

a=sprand(M,M,0.01);
b=rand(M);
tic;ref1=a*b;t_axb=toc
tic;ref2=b*a;t_bxa=toc

Here's an example with Eigen 3 and C++ using 1 thread:

//prepare acol=MxM ColMajor Eigen sparse matrix with 0.01 density
...
Map<Matrix<double,M,M,ColMajor> > bcol (PR, M, M );
double tic,toc;

tic=getHighResolutionTime();
result=acol*bcol;
toc=getHighResolutionTime();
printf("\nacol*bcol time: %f seconds", (toc - tic));

tic=getHighResolutionTime();
result=bcol*acol;
toc=getHighResolutionTime();
printf("\nbcol*acol time: %f seconds\n", (toc - tic));

When M=4000, the results are:

t_axb =
    0.6877
t_bxa =
    0.4803

acol*bcol time: 0.937590 seconds
bcol*acol time: 0.532622 seconds

When M=10000, the results are

t_axb =
   11.5649
t_bxa =
    9.7872

acol*bcol time: 20.140380 seconds
bcol*acol time: 8.061626 seconds

In both cases, sparse-dense product is slower than dense-sparse product for both Matlab and Eigen. I am curious as to

  1. Why is this the case? Are the algorithms for sparse-dense significantly difference than dense-sparse? The number of FLOPs is the same, right?

  2. Why does eigen match or exceed Matlab's performance for dense-sparse but not for sparse-dense product? A small difference in performance is normal, but a factor of ~1.4-1.8 difference seems strange given that both are highly optimized libraries. I am compiling eigen with all the optimizations as per the documentation. i.e. -fPIC -fomit-frame-pointer -O3 -DNDEBUG -fopenmp -march=native

1
In Matlab R2017b I get very similar times. For M=10000 I get t_bxa about 10% greater, not smallerLuis Mendo
Are you generating the same sparse matrices for Matlab and Eigen?chtz
Not the same matrices, but the same sparsity and size. I've tried with the same matrices and the result is very similar.avgn

1 Answers

6
votes

You could observe the same difference by comparing column-major versus row-major storage for sparse-matrix times vector product: y = A * x. If A is row-major (equivalently each coeff of y), then each row of A can be treated in parallel without any overhead (no communication, no additional temporary, no additional operation). In contrast, if A is column-major multi-threading cannot come for free, and in most cases the overhead is larger than the gain.

Even without multi-threading, you see that the memory access patterns are very different:

  • Row-major: multiple random read-only accesses to x, each coeff of y being write one only.
  • Col-major: each coeff of x is read once, but we get multiple random read-write accesses to the destination y.

So even without multi-threading the situation is naturally favorable to row-major.