Summarize non-zero values in a scipy matrix by axis

Question

I have this code to summarize each row of a scipy sparse csr matrix:

count_list = dtm.toarray().sum(axis=0)

How can I instead summarize each row as if each non-zero value was = 1? I could replace all values >0 with 1, and then use the same code as above. I could also iterate over each row in the matrix and use Numpy's count_nonzero.

count_list = [np.count_nonzero(v) for v in row.toarray() for row in dtm]

Is there any easier, or more straightforward way (similar to the method in the first example)?

Do you have any explicit zeros? Also, note that your last example won't run (since the fors are swapped). — fuglede

fuglede fuglede · Accepted Answer · 2019-12-14T19:48:45

Assuming that you have no explicit zeros, this is

count_list = dtm.indptr[1:] - dtm.indptr[:-1]

For example:

In [34]: dtm = scipy.sparse.random(1000, 1000, format='csr')                                    

In [35]: count_list_np = [np.count_nonzero(v) for row in dtm for v in row.toarray()]            

In [36]: count_list = dtm.indptr[1:] - dtm.indptr[:-1]                                          

In [37]: np.array_equal(count_list, count_list_np)                                              
Out[37]: True

If you do have explicit zeros, simply remove them first, using eliminate_zeros:

dtm.eliminate_zeros()
count_list = dtm.indptr[1:] - dtm.indptr[:-1]

Summarize non-zero values in a scipy matrix by axis

2 Answers