2
votes

I'm trying to find the dot product between a huge matrix and itself.

Shape of the matrix (371744, 36154) Num of NonZero - 577731 [very sparse]

mat1 is scipy.sparse.csr_matrix If i use mat1 * mat1.T I get a value error, this looks like its because there are too many non-zero elements in the resulting matrix and the index pointer overflows according to here

    dp_data = data_m * data_m.T
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 247, in __mul__
    return self._mul_sparse_matrix(other)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 300, in _mul_sparse_matrix
    return self.tocsr()._mul_sparse_matrix(other)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 290, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=np.intc)
ValueError: negative dimensions are not allowed

I also tried np.dot

But the doc says,
"As of NumPy 1.7, np.dot is not aware of sparse matrices, therefore using it will result on unexpected results or errors. The corresponding dense matrix should be obtained first instead"

when I to mat1.toarray() or todense() I get a memory error as the matrix is huge!! I have 16GB of memory! The program seems to work fine for smaller inputs!

    data_array = data_m.toarray()
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
    return self.tocoo(copy=False).toarray()
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)
MemoryError

I'm Using Numpy version 1.8.1 Numpy version 0.9.0

How else do I do this multiplication?

2
This is a known issue (you have more non-zero items that fit in a 32 bit int) that was solved in a recent scipy version. Update to the latest and you'll be on your way.Jaime
That worked, but still I need the huge dot product! No way I can fit this in memory, I get memory error now!ramgo
How about buying more RAM?Jaime
Pardon in advance if my recollection of LinAlg is faulty. Might you break up the problem into smaller chunks that could be recombined later? Wouldn't AB be the same as the sum over i of Abi, where bi is the ith col of B? You could extend that and make matrices from subsets of B (maybe 1/4th of the columns at a time). No idea how efficient this would be, but I think that it would get the job done. Would also be terribly prone to parallel computation.Dan
have you tried using the dot attribute of the sparse matrix instead of np.dot?Gabriel

2 Answers

3
votes

Call the dot product as a method of the sparse matrix:

dp_data = data_m.dot(data_m)

numpy.dot is a Universal Function that is unaware of your matrix's sparsity, whereas scipy.sparse.csc_matrix.dot is a Method that is tailored for your matrix type and, thus, uses a sparse algorithm.

1
votes

First of all, the size of the sparse output matrix and the amount of CPU work required by the calculations depends on the structure of your sparse matrix. If there are a lot of empty rows, things become easier. If, on the other hand, your data is evenly distributed, there is a lot to calculate.

The first thing to realize is that in this specific case (a * a.T) you are essentially calculating the dot product of each row (a 36154-element vector) with each row. This helps you cut the calculation time by half, because the result will be symmetric. (This leaves you with approximately 50 000 000 000 vector dot products.)

Now, the question is quite a lot whether you are in a hurry or not. If you are in a hurry (performance is important), then there may be ways to make the calculation faster depending on how the non-zeros are distributed in your data.

A rather straightforward algorithm is as follows:

# a is in row-form (n_row lists of non-zeros in each row)

for r in 0..n_row-1
    if all elements in the row are zero
        skip the row
    for c in r..n_row-1
        d := dot(a[c],a[r])
        if d <> 0
            result[r,c] = d
            result[c,r] = d

The dot product is easy to calculate by finding the intersection of the sets of non-zero elements in a[c] and a[r]. Most of the time the intersection is empty, and calculations are only required when it is not empty.

Depending on the number of empty rows in your matrix this may be relatively fast. On the other hand, if you do not have any empty rows or columns, there time is spent in calculating 50 000 000 000 set intersections. (Most are in that case between sets of 1, so they are simple comparisons.)

This method requires little memory, but still a lot of time, from hours to days unless there are a lot of empty rows.