Scipy Sparse Matrix - Dense Vector Multiplication Performance - Blocks vs Large Matrix

Question

I have a number of scipy sparse matrices (currently in CSR format) that I need to multiply with a dense numpy 1D vector. The vector is called G:

print G.shape, G.dtype
(2097152,) complex64

Each sparse matrix has shape (16384,2097152) and is very sparse. The density is approximately 4.0e-6. I have a list of 100 of these sparse matrices called spmats.

I can easily multiply each matrix with G like so:

res = [spmat.dot(G) for spmat in spmats]

This results in a list of dense vectors of shape (16384,) as expected.

My application is fairly perfomance-critical, and so I tried an alternate, which is to first concatenate all the sparse matrices into one large sparce matrix, and then use only one call of dot() like so:

import scipy.sparse as sp
SPMAT = sp.vstack(spmats, format='csr')
RES = SPMAT.dot(G)

This results in one long vector RES which has shape (1638400,), and is a concatenated version of all the result vectors in res above, as expected. I've checked that the results are identical.

Maybe I'm completely wrong, but I expected that the second case should be faster than the first, since there are far fewer numpy calls, memory allocations, creation of python objects, python loops, etc. I don't care about the time required to concatenate the sparse matrices, only the time to compute the result. According to %timeit, however:

%timeit res = [spmat.dot(G) for spmat in spmats]
10 loops, best of 3: 91.5 ms per loop
%timeit RES = SPMAT.dot(G)
1 loops, best of 3: 389 ms per loop

I've checked that I'm not running out of memory in either operation, and nothing fishy seems to be going on. Am I crazy, or is this really strange? Does this mean that all sparse matrix-vector products should be done in blocks, a few rows at a time, to make them faster? As far as I understand, sparse matrix multiplication time with a dense vector should be linear in the number of nonzero elements, which is unchanged in the two cases above. What could be making such a difference?

I'm running on a single core linux machine with 4GB ram, using EPD7.3

EDIT:

Here is a small example that reproduces the problem for me:

import scipy.sparse as sp
import numpy as n

G = n.random.rand(128**3) + 1.0j*n.random.rand(128**3)

spmats = [sp.rand (128**2, 128**3, density = 4e-6, format = 'csr', dtype=float64) for i in range(100)]
SPMAT = sp.vstack(spmats, format='csr')

%timeit res = [spmat.dot(G) for spmat in spmats]
%timeit RES = SPMAT.dot(G)

I get:

1 loops, best of 3: 704 ms per loop
1 loops, best of 3: 1.34 s per loop

The performance difference in this case is not as large as with my own sparse matrices that have some structure (probably because of caching) but it's still worse to concatenate the matrices.

I have tried with both scipy 10.1 and 12.0.

I can't reproduce that: the single dot product is 5 times faster for me. Since I carefully tried to do what you describe, can you please post a minimal working example to make sure we're doing the same things? — jorgeca
@jorgeca Thanks for taking the time to try and reproduce the problem. I've just edited my question with a working example of what I'm doing. — ali_p
Thanks. I can't reproduce your results (in scipy 0.12) but for me the list comprehension is 5x (!) slower when G has dtype=np.complex64 as you said, and both approaches are equally fast when G has dtype=np.complex128. — jorgeca
I cannot reproduce this with any dtype: on my machine (w/ Scipy 0.12.0) both loops are roughly the same speed for float32, float64, complex64, complex128. One thing that can cause machine-dependent speed differences like this is processor cache efficiency. But it is not clear to me why it would affect this particular case. The relevant inner loop is here; Python overhead is probably unimportant. Note also that other datatypes than G.dtype==complex128 require extra coercion of the spmat data to complex128. — pv.
In my case, I guess the complex64 slow down stems from that extra coercion (spmats[0].dot(G) takes 4.65 ms for complex128 and 22.7 ms for complex64). If I change 128 to a smaller value the difference vanishes, then grows (up to 20 times slower for 70) and eventually stabilizes around 5-6x slower. — jorgeca

ali_p ali_p · Accepted Answer · 2013-05-25T20:44:14

I haven't found a reason for the strange behaviour mentioned in the question, however I have found a way to speed up my computation significantly which may be useful for other people.

Because in my particular case I'm computing the product of a float32 sparse matrix and a complex64 dense vector, I can multiply the real and imaginary components separately. This provides me with a 4x speedup.

This takes 2.35s with SPMAT.shape == (16384000, 2097152):

RES = SPMAT.dot(G)

While this takes only 541ms:

RES = n.zeros((SPMAT.shape[0],),dtype=complex64)
RES.real = SPMAT.dot(G.real); RES.imag = SPMAT.dot(G.imag)

And the result is identical. I think maybe the n.zeros preallocation may be not neccesary but I don't know how else to do this.

Scipy Sparse Matrix - Dense Vector Multiplication Performance - Blocks vs Large Matrix

1 Answers