0
votes

I have this code to summarize each row of a scipy sparse csr matrix:

count_list = dtm.toarray().sum(axis=0)

How can I instead summarize each row as if each non-zero value was = 1? I could replace all values >0 with 1, and then use the same code as above. I could also iterate over each row in the matrix and use Numpy's count_nonzero.

count_list = [np.count_nonzero(v) for v in row.toarray() for row in dtm]

Is there any easier, or more straightforward way (similar to the method in the first example)?

2
Do you have any explicit zeros? Also, note that your last example won't run (since the fors are swapped).fuglede

2 Answers

1
votes

Assuming that you have no explicit zeros, this is

count_list = dtm.indptr[1:] - dtm.indptr[:-1]

For example:

In [34]: dtm = scipy.sparse.random(1000, 1000, format='csr')                                    

In [35]: count_list_np = [np.count_nonzero(v) for row in dtm for v in row.toarray()]            

In [36]: count_list = dtm.indptr[1:] - dtm.indptr[:-1]                                          

In [37]: np.array_equal(count_list, count_list_np)                                              
Out[37]: True

If you do have explicit zeros, simply remove them first, using eliminate_zeros:

dtm.eliminate_zeros()
count_list = dtm.indptr[1:] - dtm.indptr[:-1]
2
votes
In [1]: from scipy import sparse                                                
In [2]: M = sparse.random(10,10,.2, 'csr')                                      
In [3]: M                                                                       
Out[3]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
In [4]: M.astype(bool)                                                          
Out[4]: 
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
    with 20 stored elements in Compressed Sparse Row format>

In [6]: M.astype(bool).sum(axis=0)                                              
Out[6]: matrix([[0, 3, 4, 3, 1, 3, 1, 0, 2, 3]], dtype=int64)

Compare that with the array - converted to 0/1 integers

In [7]: M.astype(bool).astype(int).A                                            
Out[7]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 0]])

Check the total against the matrix nnz:

In [8]: M.astype(bool).sum(axis=0).sum()                                        
Out[8]: 20

With axis=0, the sum is across rows, one value per column. For sum across columns (one value per row), use axis=1):

In [13]: M.astype(bool).sum(axis=1)                                             
Out[13]: 
matrix([[0],
        [4],
        [2],
        [2],
        [3],
        [1],
        [4],
        [1],
        [1],
        [2]])

This is a (n,1) dense matrix. You can use A1 to make a 1d array: M.astype(bool).sum(axis=1).A1

The distinction is easier to see when the matrix isn't square.

count_nonzero can do the same with the dense array (but not the sparse one):

In [15]: np.count_nonzero(M.A,axis=1)                                           
Out[15]: array([0, 4, 2, 2, 3, 1, 4, 1, 1, 2])

With @fuglede's indptr approach:

In [18]: np.diff(M.indptr)                                                      
Out[18]: array([0, 4, 2, 2, 3, 1, 4, 1, 1, 2], dtype=int32)