0
votes

One of the best ways to build a scipy sparse matrix is with the coo_matrix method ie.

coo_matrix((data, (i, j)), [shape=(M, N)])

where:
data[:] are the entries of the matrix, in any order
i[:] are the row indices of the matrix entries
j[:] are the column indices of the matrix entries

But, if the matrix is very large it is not practical to load the entire i, j and data vectors into memory.

How do you build a coo_matrix such that (data, (i, j)) is fed (with an iterator or generator) from disk and the array/vector objects on disk are either in .npy or pickle formats?

Pickle is the better option as numpy.save/load are not optimized for scipy sparse. Maybe there is a another faster format.

Both numpy.genfromtext() and numpy.loadtxt() are cumbersome, slow and memory hogs.

1

1 Answers

0
votes

I don't quite understand. If the i, j, data arrays are too large to create or load into memory, then they are too large to create the sparse matrix.

If those three arrays are valid, the resulting sparse matrix will use them, without coping or alteration, as the corresponding attributes. A csr matrix constructed from the coo might be a little more compact, since its indptr array has one value per row. The data and indices arrays will be the same size as the coo (give or take given duplicates and sorting).

dok and lil formats can be used for incremental matrix creation, but they won't save memory in the long run. Both still have to have an entry for each non-zero data point. In the lil case you'll have a bunch of lists; while the dok is an actual dictionary.

None of the sparse formats is 'virtual', creating elements 'on-the-fly' as needed.

I don't see how the various methods of loading the 3 defining arrays helps if their total size too large.

In [782]: data=np.ones((10,),int)
In [783]: rows=np.arange(10)
In [784]: cols=np.arange(10)
In [785]: M=sparse.coo_matrix((data,(rows,cols)))
In [786]: M.data
Out[786]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
In [787]: M.data is data
Out[787]: True
In [789]: M.col is cols
Out[789]: True

Basically the coo format is a way of storing these 3 arrays. The real work, all the math, summation, even indexing, is performed with the csr format.