How to compute Dot product between very large scipy sparse matrix

Question

I'm trying to find the dot product between a huge matrix and itself.

Shape of the matrix (371744, 36154) Num of NonZero - 577731 [very sparse]

mat1 is scipy.sparse.csr_matrix If i use mat1 * mat1.T I get a value error, this looks like its because there are too many non-zero elements in the resulting matrix and the index pointer overflows according to here

    dp_data = data_m * data_m.T
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 247, in __mul__
    return self._mul_sparse_matrix(other)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 300, in _mul_sparse_matrix
    return self.tocsr()._mul_sparse_matrix(other)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 290, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=np.intc)
ValueError: negative dimensions are not allowed

I also tried np.dot

But the doc says,
"As of NumPy 1.7, np.dot is not aware of sparse matrices, therefore using it will result on unexpected results or errors. The corresponding dense matrix should be obtained first instead"

when I to mat1.toarray() or todense() I get a memory error as the matrix is huge!! I have 16GB of memory! The program seems to work fine for smaller inputs!

    data_array = data_m.toarray()
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
    return self.tocoo(copy=False).toarray()
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)
MemoryError

I'm Using Numpy version 1.8.1 Numpy version 0.9.0

How else do I do this multiplication?

This is a known issue (you have more non-zero items that fit in a 32 bit int) that was solved in a recent scipy version. Update to the latest and you'll be on your way. — Jaime
That worked, but still I need the huge dot product! No way I can fit this in memory, I get memory error now! — ramgo
Pardon in advance if my recollection of LinAlg is faulty. Might you break up the problem into smaller chunks that could be recombined later? Wouldn't AB be the same as the sum over i of Abi, where bi is the ith col of B? You could extend that and make matrices from subsets of B (maybe 1/4th of the columns at a time). No idea how efficient this would be, but I think that it would get the job done. Would also be terribly prone to parallel computation. — Dan
have you tried using the dot attribute of the sparse matrix instead of np.dot? — Gabriel

ChrisM ChrisM · Accepted Answer · 2015-05-13T22:35:33

Call the dot product as a method of the sparse matrix:

dp_data = data_m.dot(data_m)

numpy.dot is a Universal Function that is unaware of your matrix's sparsity, whereas scipy.sparse.csc_matrix.dot is a Method that is tailored for your matrix type and, thus, uses a sparse algorithm.

How to compute Dot product between very large scipy sparse matrix

2 Answers