I have a sparse matrix where I'm currently enumerating over each row and performing some calculations based on the information from each row. Each row is completely independent of the others. However, for large matrices, this code is extremely slow (takes about 2 hours) and I can't convert the matrix to a dense one either (limited to 8GB RAM).
import scipy.sparse
import numpy as np
def process_row(a, b):
"""
a - contains the row indices for a sparse matrix
b - contains the column indices for a sparse matrix
Returns a new vector of length(a)
"""
return
def assess(mat):
"""
"""
mat_csr = mat.tocsr()
nrows, ncols = mat_csr.shape
a = np.arange(ncols, dtype=np.int32)
b = np.empty(ncols, dtype=np.int32)
result = []
for i, row in enumerate(mat_csr):
# Process one row at a time
b.fill(i)
result.append(process_row(b, a))
return result
if __name__ == '__main__':
row = np.array([8,2,7,4])
col = np.array([1,3,2,1])
data = np.array([1,1,1,1])
mat = scipy.sparse.coo_matrix((data, (row, col)))
print assess(mat)
I am looking to see if there's any way to design this better so that it performs much faster. Essentially, the process_row
function takes (row, col) index pairs (from a, b) and does some math using another sparse matrix and returns a result. I don't have the option to change this function but it can actually process different row/col pairs and is not restricted to processing everything from the same row.