0
votes

I have a sparse matrix stored on disk in coordinate format, (triplet format). I would like to read chunks of the matrix into memory, using scipy.sparse, however, when doing this, scipy will always assume a dense matrix indexing from 0,0, regardless of the chunk. This means, for example, that for the last 'chunk' in the sparse matrix scipy will interpret as being a huge matrix that only has some values in the bottom right corner.

How can I correctly handle the chunks so that when doing toarray to create a dense matrix it only creates the subset corresponding to that chunk?

The reason for doing this is that, even sparse, the matrix is too large for memory (approx 600 million 32bit floating point values) and to display on screen (as the matrix represents a geospatial raster) I need to convert it to a dense matrix to store in a geospatial format (e.g. geotiff).

1

1 Answers

0
votes

You should be able tweak the row and col values when building the subset. For example:

In [84]: row=np.arange(10)    
In [85]: col=np.random.randint(0,6,row.shape)
In [86]: data=np.ones(row.shape,dtype=int)*2

In [87]: M=sparse.coo_matrix((data,(row,col)),shape=(10,6))

In [88]: M.A
Out[88]: 
array([[0, 0, 2, 0, 0, 0],
       [0, 0, 0, 0, 0, 2],
       [0, 0, 0, 2, 0, 0],
       [0, 0, 2, 0, 0, 0],
       [0, 0, 2, 0, 0, 0],
       [0, 2, 0, 0, 0, 0],
       [2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 2, 0],
       [0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 0, 2]])

To build a matrix with a subset of the rows use:

In [89]: M1=sparse.coo_matrix((data[5:],(row[5:]-5,col[5:])),shape=(5,6))

In [90]: M1.A
Out[90]: 
array([[0, 2, 0, 0, 0, 0],
       [2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 2, 0],
       [0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 0, 2]])

You'll have to decide whether you want to specify the shape for M1, or let it deduce it from the range of row and col.

If these coordinates are not sorted, or you also want to take a subrange of col, things could get more complicated. But I think this captures the basic idea.