Edit: See this question where I learned how to parallelize sparse matrix-vector multiplication in Python using Numba, and was able to tie with Matlab.
Original question:
I'm observing that sparse matrix-vector multiplication is about 4 or 5 times faster in Matlab than in Python (using scipy sparse matrices). Here are some details from the Matlab command line:
>> whos A
Name Size Bytes Class Attributes
A 47166x113954 610732376 double sparse
>> whos ATrans
Name Size Bytes Class Attributes
ATrans 113954x47166 610198072 double sparse
>> nnz(A)/numel(A)
ans =
0.0071
>> whos x
Name Size Bytes Class Attributes
x 113954x1 911632 double
>> myFun = @() A*x; timeit(myFun)
ans =
0.0601
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.0120
The matrix ATrans is the transpose of A. Notice that computing A times x in Matlab takes about .06 seconds, but if I use the weird "transpose trick" and compute ATrans' times x (which yields the same result as A times x), the computation takes .012 seconds. (I do not understand why this trick works.)
Here are some timing results from the Python command line:
In[50]: type(A)
Out[50]:
scipy.sparse.csc.csc_matrix
In[51]: A.shape
Out[51]:
(47166, 113954)
In[52]: type(x)
Out[52]:
numpy.ndarray
In[53]: x.shape
Out[53]:
(113954L, 1L)
In[54]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
...:
Out[54]:
0.054835451461831324
So the Python runtime to do A times x is about 4.5 times longer than the Matlab runtime, for the same matrix A. If I store A in csr format, the Python runtime is a little bit worse:
In[63]: A = sp.sparse.csr_matrix(A)
In[64]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
...:
Out[64]:
0.0722580226496575
Here's some info about which python version and which anaconda version I'm using:
In[2]: import sys; print('Python %s on %s' % (sys.version, sys.platform))
Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] on win32
Question: Why is this sparse matrix-vector multiplication faster in Matlab than in Python? And how can I make it equally fast in Python?
Edit 1: Here is a clue. In Python, if I set the number of threads to 1, the runtime for dense matrix-vector multiplication is impacted severely, but the runtime for sparse matrix-vector multiplication is nearly unchanged.
In[48]: M = np.random.rand(1000,1000)
In[49]: y = np.random.rand(1000,1)
In[50]: import mkl
In[51]: mkl.get_max_threads()
Out[51]:
20
In[52]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[52]:
7.232593519574948e-05
In[53]: mkl.set_num_threads(1)
In[54]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[54]:
0.00044465965093536396
In[56]: type(A)
Out[56]:
scipy.sparse.csc.csc_matrix
In[57]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[57]:
0.055780856886028685
In[58]: mkl.set_num_threads(20)
In[59]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[59]:
0.05550840215802509
So for dense matrix-vector products, setting the number of threads to 1 reduced the runtime by a factor of about 6. But for the sparse matrix-vector product, reducing the number of threads to 1 did not change the runtime.
I think this suggests that in Python the sparse matrix-vector multiplication is not being performed in parallel, whereas dense matrix-vector multiplication is making use of all available cores. Do you agree with this conclusion? And if so, is there a way to make use of all available cores for sparse matrix-vector multiplication in Python?
Note that Anaconda is supposed to use Intel MKL optimizations by default: https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations
Edit 2: I read here that "For sparse matrices, all level 2 [BLAS] operations except for the sparse triangular solvers are threaded" in the Intel MKL. This suggests to me that scipy is not using Intel MKL to perform sparse matrix-vector multiplication. It seems that @hpaulj (in an answer posted below) has confirmed this conclusion by inspecting the code for the function csr_matvec. So, can I just call the Intel MKL sparse matrix-vector multiplication function directly? How would I do that?
Edit 3: Here is an additional piece of evidence. The Matlab sparse matrix-vector multiplication runtime appears to be unchanged when I set the maximum number of threads equal to 1.
>> maxNumCompThreads
ans =
20
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.012545604076342
>> maxNumCompThreads(1) % set number of threads to 1
ans =
20
>> maxNumCompThreads % Check that max number of threads is 1
ans =
1
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.012164191957568
This makes me question my previous theory that Matlab's advantage is due to multithreading.
MKL
libraries with the scipy code. Study and get back to us with any further questions. – hpauljdense
dot
delegates the task to BLAS or MKL libraries right away. The sparse calculation uses its own code probably written incython
. We'd have to examine that code to see whether it delegates any action to math libraries. – hpaulj