
I have this code to summarize each row of a scipy sparse csr matrix:

count_list = dtm.toarray().sum(axis=0)

How can I instead summarize each row as if each non-zero value was = 1? I could replace all values >0 with 1, and then use the same code as above. I could also iterate over each row in the matrix and use Numpy's count_nonzero.

count_list = [np.count_nonzero(v) for v in row.toarray() for row in dtm]

Is there any easier, or more straightforward way (similar to the method in the first example)?

Do you have any explicit zeros? Also, note that your last example won't run (since the fors are swapped).fuglede

2 Answers


Assuming that you have no explicit zeros, this is

count_list = dtm.indptr[1:] - dtm.indptr[:-1]

For example:

In [34]: dtm = scipy.sparse.random(1000, 1000, format='csr')                                    

In [35]: count_list_np = [np.count_nonzero(v) for row in dtm for v in row.toarray()]            

In [36]: count_list = dtm.indptr[1:] - dtm.indptr[:-1]                                          

In [37]: np.array_equal(count_list, count_list_np)                                              
Out[37]: True

If you do have explicit zeros, simply remove them first, using eliminate_zeros:

count_list = dtm.indptr[1:] - dtm.indptr[:-1]
In [1]: from scipy import sparse                                                
In [2]: M = sparse.random(10,10,.2, 'csr')                                      
In [3]: M                                                                       
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
In [4]: M.astype(bool)                                                          
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
    with 20 stored elements in Compressed Sparse Row format>

In [6]: M.astype(bool).sum(axis=0)                                              
Out[6]: matrix([[0, 3, 4, 3, 1, 3, 1, 0, 2, 3]], dtype=int64)

Compare that with the array - converted to 0/1 integers

In [7]: M.astype(bool).astype(int).A                                            
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 0]])

Check the total against the matrix nnz:

In [8]: M.astype(bool).sum(axis=0).sum()                                        
Out[8]: 20

With axis=0, the sum is across rows, one value per column. For sum across columns (one value per row), use axis=1):

In [13]: M.astype(bool).sum(axis=1)                                             

This is a (n,1) dense matrix. You can use A1 to make a 1d array: M.astype(bool).sum(axis=1).A1

The distinction is easier to see when the matrix isn't square.

count_nonzero can do the same with the dense array (but not the sparse one):

In [15]: np.count_nonzero(M.A,axis=1)                                           
Out[15]: array([0, 4, 2, 2, 3, 1, 4, 1, 1, 2])

With @fuglede's indptr approach:

In [18]: np.diff(M.indptr)                                                      
Out[18]: array([0, 4, 2, 2, 3, 1, 4, 1, 1, 2], dtype=int32)