5
votes

I'm trying to calculate the mean of non-zero values in each row of a sparse row matrix. Using the matrix's mean method doesn't do it:

>>> from scipy.sparse import csr_matrix
>>> a = csr_matrix([[0, 0, 2], [1, 3, 8]])
>>> a.mean(axis=1)
matrix([[ 0.66666667],
        [ 4.        ]])

The following works but is slow for large matrices:

>>> import numpy as np
>>> b = np.zeros(a.shape[0])
>>> for i in range(a.shape[0]):
...    b[i] = a.getrow(i).data.mean()
... 
>>> b
array([ 2.,  4.])

Could anyone please tell me if there is a faster method?

4

4 Answers

6
votes

This seems the typical problem where you can use numpy.bincount. For this I made use of three functions:

(x,y,z)=scipy.sparse.find(a)

returns rows(x),columns(y) and values(z) of the sparse matrix. For instace, x is array([0, 1, 1, 1].

numpy.bincount(x) returns, for each row number, how meny nonzero elemnts you have.

numpy.bincount(x,wights=z) returns, for each row , the sums of non-zero elements.

A final working code:

from scipy.sparse import csr_matrix
a = csr_matrix([[0, 0, 2], [1, 3, 8]])

import numpy
import scipy.sparse
(x,y,z)=scipy.sparse.find(a)
countings=numpy.bincount(x)
sums=numpy.bincount(x,weights=z)
averages=sums/countings

print(averages)

returns:

[ 2.  4.]
7
votes

With a CSR format matrix, you can do this even more easily:

sums = a.sum(axis=1).A1
counts = np.diff(a.indptr)
averages = sums / counts

Row-sums are directly supported, and the structure of the CSR format means that the difference between successive values in the indptr array correspond exactly to the number of nonzero elements in each row.

3
votes

I always like summing the values over whatever axis you are interested in and dividing by the total of the nonzero elements in the respective row/column.

Like so:

sp_arr = csr_matrix([[0, 0, 2], [1, 3, 8]])
col_avg = sp_arr.sum(0) / (sp_arr != 0).sum(0)
row_avg = sp_arr.sum(1) / (sp_arr != 0).sum(1)
print(col_avg)
matrix([[ 1.,  3.,  5.]])
print(row_avg)
matrix([[ 2.],
        [ 4.]])

Basically you are summing the total value of all entries along the given axis and dividing by the sum of the True entries where the matrix != 0 (which is the number of real entries).

I find this approach less complicated and easier than the other options.

1
votes

A simple method to return a list of average value:

a.sum(axis=0) / a.getnnz(axis=0)

Assume that you don't have any explicit zero in your matrix. Change the axis if you will.