3
votes

I have two (scipy) CSR sparse matrices:

A (12414693, 235470)
B (235470, 48063)

Performing:

A.dot(B)

causes a segmentation fault.

What am I doing wrong?

EDIT

I've submitted a bug to the scipy developer community: https://github.com/scipy/scipy/issues/3212

2
But these are sparse matrices, and so is the result. In fact, the result is supposed to be smaller than A (in terms of dimensions).Omer
As @larsmans suggested in a comment that is now gone, the product of sparse matrices can result in drastic fill-in and give a densely populated matrix. That might not be the problem here. The seg. fault indicates a bug somewhere in the scipy code. It could be the result of a mishandled memory allocation failure, or something else.Warren Weckesser
Matrix A is a row-matrix of "one hot" vectors, so the result shouldn't be too dense. Is this something worth reporting to the scipy developer community?Omer
Yes, definitely; click on the green "New Issue" button here: github.com/scipy/scipy/issues. Be sure to mention the version of scipy that you are using (import scipy; scipy.__version__).Warren Weckesser
Extreme case: in your "one hot" encoding in A, suppose all the 1s are in column k. And suppose in B, all the values in row k are nonzero, and those are the only nonzero values. Then both A and B are very sparse, but the product is fully populated with nonzero values.Warren Weckesser

2 Answers

4
votes

Your problem is very likely being caused by an overflow of an index stored in an int32, caused by the result of your dot product having more than 2^31 non-zero entries. Try the following...

>>> import scipy.sparse
>>> c = np.empty_like(A.indptr)
>>> scipy.sparse.sparsetools.csr_matmat_pass1(A.shape[0], B.shape[1], A.indptr,
                                              A.indices, B.indptr, B.indices, c)
>>> np.all(np.diff(c) >= 0)

With your data, The array c is a vector of 12414693 + 1 items, holding the accumulated number of non-zero entries per row in the product of your two matrices, i.e. it is what C.indptr will be if C = A.dot(B) finishes successfully. It is of type np.int32, even on 64 bit platforms, which is not good. If your sparse matrix is too large, there will be an overflow, that last line will return False, and the arrays to store the result of the matrix product will be instantiated with a wrong size (the last item of c, which is very likely to be a negative number if an overflow did happen). If that's the case, then yes, file a bug report...

2
votes

This link may be helpful: < http://blog.newsle.com/2013/02/01/text-classification-and-feature-hashing-sparse-matrix-vector-multiplication-in-cython/ >. The product of these will be too large. I'm not sure if the advice of the article applies for you, but you may try organizing the second matrix as a CSC type.