Remove nan rows in a scipy sparse matrix

Question

I am given a (normalized) sparse adjacency matrix and a list of labels for the respective matrix rows. Because some nodes have been removed by another sanitization function, there are some rows containing NaNs in the matrix. I want to find these rows and remove them as well as their respective labels. Here is the function I wrote:

def sanitize_nan_rows(adj, labels):
    # convert to numpy array and keep dimension
    adj = np.array(adj, ndmin=2)

    for i, row in enumerate(adj):
        # check if row all nans
        if np.all(np.isnan(row)):
            # print("Removing nan row label in %s" % i)
            # remove row index from labels
            del labels[i]
    # remove all nan rows
    adj = adj[~np.all(np.isnan(adj), axis=1)]
    # return sanitized adj and labels_clean
    return adj, labels

labels is a simple Python list and adj has the type <class 'scipy.sparse.lil.lil_matrix'> (containing elements of type <class 'numpy.float64'>), which are both the result of

adj, labels = nx.attr_sparse_matrix(infected, normalized=True)

On execution I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-503-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)

<ipython-input-502-ead99efec677> in sanitize_nans(adj, labels)
      6     for i, row in enumerate(adj):
      7         # check if row all nans
----> 8         if np.all(np.isnan(row)):
      9             print("Removing nan row label in %s" % i)
     10             # remove row index from labels

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

So I thought that SciPy NaNs were different from numpy NaNs. After that I tried to convert the sparse matrix into a numpy array (taking the risk of flooding my RAM, because the matrix has about 40k rows and columns). When running it, the error stays the same however. It seems that the np.array() call just wrapped the sparse matrix and didn't convert it, as type(row) inside the for loop still outputs <class 'scipy.sparse.lil.lil_matrix'>

So my question is how to resolve this issue and whether there is a better approach that gets the job done. I am fairly new to numpy and scipy (as used in networkx), so I'd appreciate an explanation. Thank you!

EDIT: After changing the conversion to what hpaulj proposed, I'm getting a MemoryError:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-519-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)

<ipython-input-518-44201f4ff35c> in sanitize_nans(adj, labels)
      1 def sanitize_nans(adj, labels):
----> 2     adj = adj.toarray()
      3 
      4     for i, row in enumerate(adj):
      5         # check if row all nans

/usr/lib/python3/dist-packages/scipy/sparse/lil.py in toarray(self, order, out)
    348     def toarray(self, order=None, out=None):
    349         """See the docstring for `spmatrix.toarray`."""
--> 350         d = self._process_toarray_args(order, out)
    351         for i, row in enumerate(self.rows):
    352             for pos, j in enumerate(row):

    /usr/lib/python3/dist-packages/scipy/sparse/base.py in_process_toarray_args(self, order, out)
    697             return out
    698         else:
--> 699             return np.zeros(self.shape, dtype=self.dtype, order=order)
    700 
    701 

MemoryError:

So apparently I'll have to stick with the sparse matrix to save RAM.

A sparse matrix is not a dense array. Look at adj.data and adj.rows. For an lil matrix these are object arrays of list, one pair of sublists per row of the array. — hpaulj
Thanks for your quick reply! I edited the question according to your proposed changes and my results. (I just changed the conversion line to adj = adj.toarray()) — dmuhs
Is the whole row of adj NaN or just the nonzero values of the row? A large sparse matrix may have thousands of columns, but only hundreds of non-zero entries per row. Most values in a row will be 0 (and absent from the sparse data base). — hpaulj

hpaulj hpaulj · Accepted Answer · 2016-09-07T22:11:18

If I make a sample array:

In [328]: A=np.array([[1,0,0,np.nan],[0,np.nan,np.nan,0],[1,0,1,0]])
In [329]: A
Out[329]: 
array([[  1.,   0.,   0.,  nan],
       [  0.,  nan,  nan,   0.],
       [  1.,   0.,   1.,   0.]])

In [331]: M=sparse.lil_matrix(A)

This lil sparse matrix is stored in 2 arrays:

In [332]: M.data
Out[332]: array([[1.0, nan], [nan, nan], [1.0, 1.0]], dtype=object)
In [333]: M.rows
Out[333]: array([[0, 3], [1, 2], [0, 2]], dtype=object)

With your function, no rows will be removed, even though the middle row of the sparse matrix only contains nan.

In [334]: A[~np.all(np.isnan(A), axis=1)]
Out[334]: 
array([[  1.,   0.,   0.,  nan],
       [  0.,  nan,  nan,   0.],
       [  1.,   0.,   1.,   0.]])

I could test the rows of M for nan, and identify the ones that only contain nan (besides 0s). But it's probably easier to collect the ones that we want to keep.

In [346]: ll = [i for i,row in enumerate(M.data) if not np.all(np.isnan(row))]
In [347]: ll
Out[347]: [0, 2]
In [348]: M[ll,:]
Out[348]: 
<2x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in LInked List format>
In [349]: _.A
Out[349]: 
array([[  1.,   0.,   0.,  nan],
       [  1.,   0.,   1.,   0.]])

A row of M is a list, but np.isnan(row) will convert it to an array and do it's array test.

Remove nan rows in a scipy sparse matrix

1 Answers