I am given a (normalized) sparse adjacency matrix and a list of labels for the respective matrix rows. Because some nodes have been removed by another sanitization function, there are some rows containing NaNs in the matrix. I want to find these rows and remove them as well as their respective labels. Here is the function I wrote:
def sanitize_nan_rows(adj, labels):
# convert to numpy array and keep dimension
adj = np.array(adj, ndmin=2)
for i, row in enumerate(adj):
# check if row all nans
if np.all(np.isnan(row)):
# print("Removing nan row label in %s" % i)
# remove row index from labels
del labels[i]
# remove all nan rows
adj = adj[~np.all(np.isnan(adj), axis=1)]
# return sanitized adj and labels_clean
return adj, labels
labels
is a simple Python list and adj
has the type <class 'scipy.sparse.lil.lil_matrix'>
(containing elements of type <class 'numpy.float64'>
), which are both the result of
adj, labels = nx.attr_sparse_matrix(infected, normalized=True)
On execution I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-503-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-502-ead99efec677> in sanitize_nans(adj, labels)
6 for i, row in enumerate(adj):
7 # check if row all nans
----> 8 if np.all(np.isnan(row)):
9 print("Removing nan row label in %s" % i)
10 # remove row index from labels
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
So I thought that SciPy NaNs were different from numpy NaNs. After that I tried to convert the sparse matrix into a numpy array (taking the risk of flooding my RAM, because the matrix has about 40k rows and columns). When running it, the error stays the same however. It seems that the np.array()
call just wrapped the sparse matrix and didn't convert it, as type(row)
inside the for loop still outputs <class 'scipy.sparse.lil.lil_matrix'>
So my question is how to resolve this issue and whether there is a better approach that gets the job done. I am fairly new to numpy and scipy (as used in networkx), so I'd appreciate an explanation. Thank you!
EDIT: After changing the conversion to what hpaulj proposed, I'm getting a MemoryError:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-519-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-518-44201f4ff35c> in sanitize_nans(adj, labels)
1 def sanitize_nans(adj, labels):
----> 2 adj = adj.toarray()
3
4 for i, row in enumerate(adj):
5 # check if row all nans
/usr/lib/python3/dist-packages/scipy/sparse/lil.py in toarray(self, order, out)
348 def toarray(self, order=None, out=None):
349 """See the docstring for `spmatrix.toarray`."""
--> 350 d = self._process_toarray_args(order, out)
351 for i, row in enumerate(self.rows):
352 for pos, j in enumerate(row):
/usr/lib/python3/dist-packages/scipy/sparse/base.py in_process_toarray_args(self, order, out)
697 return out
698 else:
--> 699 return np.zeros(self.shape, dtype=self.dtype, order=order)
700
701
MemoryError:
So apparently I'll have to stick with the sparse matrix to save RAM.
adj.data
andadj.rows
. For anlil
matrix these are object arrays of list, one pair of sublists per row of the array. – hpauljadj.A
oradj.toarray()
produces an array – hpauljadj = adj.toarray()
) – dmuhsadj
NaN or just the nonzero values of the row? A large sparse matrix may have thousands of columns, but only hundreds of non-zero entries per row. Most values in a row will be 0 (and absent from the sparse data base). – hpaulj