Optimize Scipy Sparse Matrix

Question

I have a sparse matrix where I'm currently enumerating over each row and performing some calculations based on the information from each row. Each row is completely independent of the others. However, for large matrices, this code is extremely slow (takes about 2 hours) and I can't convert the matrix to a dense one either (limited to 8GB RAM).

import scipy.sparse
import numpy as np

def process_row(a, b):
    """
    a - contains the row indices for a sparse matrix
    b - contains the column indices for a sparse matrix

    Returns a new vector of length(a)
    """

    return

def assess(mat):
    """
    """
    mat_csr = mat.tocsr()
    nrows, ncols = mat_csr.shape
    a = np.arange(ncols, dtype=np.int32)
    b = np.empty(ncols, dtype=np.int32)
    result = []

    for i, row in enumerate(mat_csr):
        # Process one row at a time
        b.fill(i)
        result.append(process_row(b, a))

    return result

if __name__ == '__main__':
    row  = np.array([8,2,7,4])
    col  = np.array([1,3,2,1])
    data = np.array([1,1,1,1])

    mat = scipy.sparse.coo_matrix((data, (row, col)))
    print assess(mat)

I am looking to see if there's any way to design this better so that it performs much faster. Essentially, the process_row function takes (row, col) index pairs (from a, b) and does some math using another sparse matrix and returns a result. I don't have the option to change this function but it can actually process different row/col pairs and is not restricted to processing everything from the same row.

hpaulj hpaulj · Accepted Answer · 2016-04-12T19:20:58

Your problem looks similar to this other recent SO question:

Calculate the euclidean distance in scipy csr matrix

In my answer I sketched a way of iterating over the rows of a sparse matrix. I think it is faster to convert the array to lil, and construct the dense rows directly from its sublists. This avoids the overhead of creating a new sparse matrix for each row. But I haven't done time tests.

https://stackoverflow.com/a/36559702/901925

Maybe this applies to your case.

Optimize Scipy Sparse Matrix

1 Answers