Is sparse matrix-vector multiplication faster in Matlab than in Python?

Question

Edit: See this question where I learned how to parallelize sparse matrix-vector multiplication in Python using Numba, and was able to tie with Matlab.

Original question:

I'm observing that sparse matrix-vector multiplication is about 4 or 5 times faster in Matlab than in Python (using scipy sparse matrices). Here are some details from the Matlab command line:

>> whos A
  Name          Size                    Bytes  Class     Attributes

  A         47166x113954            610732376  double    sparse    

>> whos ATrans
  Name             Size                   Bytes  Class     Attributes

  ATrans      113954x47166            610198072  double    sparse    

>> nnz(A)/numel(A)

ans =

    0.0071

>> whos x
  Name           Size             Bytes  Class     Attributes

  x         113954x1             911632  double              

>> myFun = @() A*x; timeit(myFun)

ans =

    0.0601

>> myFun = @() ATrans'*x; timeit(myFun)

ans =

    0.0120

The matrix ATrans is the transpose of A. Notice that computing A times x in Matlab takes about .06 seconds, but if I use the weird "transpose trick" and compute ATrans' times x (which yields the same result as A times x), the computation takes .012 seconds. (I do not understand why this trick works.)

Here are some timing results from the Python command line:

In[50]: type(A)
Out[50]: 
scipy.sparse.csc.csc_matrix
In[51]: A.shape
Out[51]: 
(47166, 113954)
In[52]: type(x)
Out[52]: 
numpy.ndarray
In[53]: x.shape
Out[53]: 
(113954L, 1L)
In[54]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
   ...: 
Out[54]: 
0.054835451461831324

So the Python runtime to do A times x is about 4.5 times longer than the Matlab runtime, for the same matrix A. If I store A in csr format, the Python runtime is a little bit worse:

In[63]: A = sp.sparse.csr_matrix(A)
In[64]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
   ...: 
Out[64]: 
0.0722580226496575

Here's some info about which python version and which anaconda version I'm using:

In[2]: import sys; print('Python %s on %s' % (sys.version, sys.platform))
Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] on win32

Question: Why is this sparse matrix-vector multiplication faster in Matlab than in Python? And how can I make it equally fast in Python?

Edit 1: Here is a clue. In Python, if I set the number of threads to 1, the runtime for dense matrix-vector multiplication is impacted severely, but the runtime for sparse matrix-vector multiplication is nearly unchanged.

In[48]: M = np.random.rand(1000,1000)
In[49]: y = np.random.rand(1000,1)
In[50]: import mkl
In[51]: mkl.get_max_threads()
Out[51]: 
20
In[52]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[52]: 
7.232593519574948e-05
In[53]: mkl.set_num_threads(1)
In[54]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[54]: 
0.00044465965093536396
In[56]: type(A)
Out[56]: 
scipy.sparse.csc.csc_matrix
In[57]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[57]: 
0.055780856886028685
In[58]: mkl.set_num_threads(20)
In[59]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[59]: 
0.05550840215802509

So for dense matrix-vector products, setting the number of threads to 1 reduced the runtime by a factor of about 6. But for the sparse matrix-vector product, reducing the number of threads to 1 did not change the runtime.

I think this suggests that in Python the sparse matrix-vector multiplication is not being performed in parallel, whereas dense matrix-vector multiplication is making use of all available cores. Do you agree with this conclusion? And if so, is there a way to make use of all available cores for sparse matrix-vector multiplication in Python?

Note that Anaconda is supposed to use Intel MKL optimizations by default: https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations

Edit 2: I read here that "For sparse matrices, all level 2 [BLAS] operations except for the sparse triangular solvers are threaded" in the Intel MKL. This suggests to me that scipy is not using Intel MKL to perform sparse matrix-vector multiplication. It seems that @hpaulj (in an answer posted below) has confirmed this conclusion by inspecting the code for the function csr_matvec. So, can I just call the Intel MKL sparse matrix-vector multiplication function directly? How would I do that?

Edit 3: Here is an additional piece of evidence. The Matlab sparse matrix-vector multiplication runtime appears to be unchanged when I set the maximum number of threads equal to 1.

>> maxNumCompThreads

ans =

    20

>> myFun = @() ATrans'*x; timeit(myFun)

ans =

   0.012545604076342

>> maxNumCompThreads(1) % set number of threads to 1

ans =

    20

>> maxNumCompThreads % Check that max number of threads is 1

ans =

     1

>> myFun = @() ATrans'*x; timeit(myFun)

ans =

   0.012164191957568

This makes me question my previous theory that Matlab's advantage is due to multithreading.

MATLAB is a commercial product with compiled code that isn't available to us. The Octave equivalent is open source, and thus is fair game for comparison. stackoverflow.com/questions/37536106/… explores using MKL libraries with the scipy code. Study and get back to us with any further questions. — hpaulj
The dense dot delegates the task to BLAS or MKL libraries right away. The sparse calculation uses its own code probably written in cython. We'd have to examine that code to see whether it delegates any action to math libraries. — hpaulj

hpaulj hpaulj · Accepted Answer · 2017-08-05T00:21:54

We get variations in times depending on the mix of sparse and dense:

In [40]: A = sparse.random(1000,1000,.1, format='csr')
In [41]: x = np.random.random((1000,1))
In [42]: Ad = A.A
In [43]: xs = sparse.csr_matrix(x)

sparse with dense produces dense, but sparse with sparse produces sparse:

In [47]: A.dot(xs)
Out[47]: 
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1000 stored elements in Compressed Sparse Row format>

In [48]: np.allclose(A.dot(x), Ad.dot(x))
Out[48]: True
In [49]: np.allclose(A.dot(x), A.dot(xs).A)
Out[49]: True

Compared to the alternatives, sparse with dense looks pretty good:

In [50]: timeit A.dot(x)    # sparse with dense
137 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [51]: timeit Ad.dot(x)    # dense with dense
1.03 ms ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [52]: timeit A.dot(xs)   # sparse with sparse
1.44 ms ± 644 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With much bigger matrices relative times could well be different. Same for different sparsity.

I don't have access to MATLAB so can't make an equivalent test, though I could try this on Octave.

This is on a basic Linux (ubuntu) computer with up to date numpy and scipy.

My previous explorations, Directly use Intel mkl library on Scipy sparse matrix to calculate A dot A.T with less memory, dealt with sparse matrix with matrix calculations. Sparse with dense must use different code. We'd have to trace it to see whether it is delegating anything special to the underlying math libraries.

csc format is slightly slower for this calculation - probably because the basic iteration is across rows.

In [80]: Ac = A.tocsc()
In [81]: timeit Ac.dot(x)
216 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A.dot(x) can be trace through to this call to compiled code:

In [70]: result = np.zeros((1000,1))
In [71]: sparse._sparsetools.csr_matvec(1000,1000,A.indptr, A.indices, A.data, x, result)

_sparsetools is a .so file, probably compiled from Cython (.pyx) code.

In sparsetools/csr.h the code (template) for matvec is:

void csr_matvec(const I n_row,
                const I n_col,
                const I Ap[],
                const I Aj[],
                const T Ax[],
                const T Xx[],
                      T Yx[])
{
    for(I i = 0; i < n_row; i++){
        T sum = Yx[i];
        for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
            sum += Ax[jj] * Xx[Aj[jj]];
        }
        Yx[i] = sum;
    }
}

That looks like straight forward c++ iteration over the csr attributes (indptr, indices), multiplying and summing. There's no attempt to make use of optimized math libraries or parallel cores.

https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h

Is sparse matrix-vector multiplication faster in Matlab than in Python?

2 Answers