I have a bunch of data in SciPy compressed sparse row (CSR) format. Of course the majority of elements is zero, and I further know that all non-zero elements have a value of 1. I want to compute sums over different subsets of rows of my matrix. At the moment I am doing the following:
import numpy as np
import scipy as sp
import scipy.sparse
# create some data with sparsely distributed ones
data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
data = sp.sparse.csr_matrix(data, dtype='int8')
# generate column-wise sums over random subsets of rows
nrand = 1000
for k in range(nrand):
inds = np.random.choice(data.shape[0], size=100, replace=False)
# 60% of time is spent here
extracted_rows = data[inds]
# 20% of time is spent here
row_sum = extracted_rows.sum(axis=0)
The last few lines there are the bottleneck in a larger computational pipeline. As I annotated in the code, 60% of time is spent slicing the data from the random indices, and 20% is spent computing the actual sum.
It seems to me I should be able to use my knowledge about the data in the array (i.e., any non-zero value in the sparse matrix will be 1; no other values present) to compute these sums more efficiently. Unfortunately, I cannot figure out how. Dealing with just data.indices
perhaps? I have tried other sparsity structures (e.g. CSC matrix), as well as converting to dense array first, but these approaches were all slower than this CSR matrix approach.
rows, cols = extracted_rows.nonzero()
which gives you the indices of the nonzero components and possiblynp.count_nonzero()
which counts nonzero entries in a numb array – benbo