2
votes

I have many lists that represent a sparse matrix (i.e., the columns that have nonzero entries) that I need to represent as a SciPy sparse csc_matrix. However, note that there is only one row in my sparse matrix and so the list simply points to the columns within this row that has nonzero entries. For example:

sparse_input = [4, 10, 21]  # My lists are much, much longer but very sparse

This list tells me which columns within my single row sparse matrix where there are nonzero values. This is what the dense matrix would look like.

x = np.array([[0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1]])

I could use the (data, (row, col)) syntax but since my lists are super long the csc_matrix takes a lot of time and memory to build. So, I was thinking about using the indptr interface but I'm having trouble figuring out how to quickly and automatically build the indptr directly from a given sparse list of nonzero column entries. I tried looking at csr_matrix(x).indptr and I see that the indptr looks like:

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3], dtype=int32)

I've read the SciPy docs and the Sparse Matrix Wikipedia page but I can't seem to come up with an efficient method to construct the indptr directly from a list of nonzero columns. It just feels like indptr shouldn't be this long in length considering that there are only three nonzero entries in the sparse matrix.

1

1 Answers

0
votes

How about just making the matrices, and exploring their attributs?

In [144]: from scipy import sparse                                                             
In [145]: x = np.array([[0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1]])                        

In [146]: M = sparse.coo_matrix(x)                                                             
In [147]: M                                                                                    
Out[147]: 
<1x22 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in COOrdinate format>
In [148]: M.row                                                                                
Out[148]: array([0, 0, 0], dtype=int32)
In [149]: M.col                                                                                
Out[149]: array([ 4, 10, 21], dtype=int32)
In [150]: M.data                                                                               
Out[150]: array([1, 1, 1])

csr:

In [152]: Mr = M.tocsr()                                                                       
In [153]: Mr.indptr                                                                            
Out[153]: array([0, 3], dtype=int32)
In [155]: Mr.indices                                                                           
Out[155]: array([ 4, 10, 21], dtype=int32)
In [156]: Mr.data                                                                              
Out[156]: array([1, 1, 1], dtype=int64)

csc:

In [157]: Mc = M.tocsc()                                                                       
In [158]: Mc.indptr                                                                            
Out[158]: 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3], dtype=int32)
In [159]: Mc.indices                                                                           
Out[159]: array([0, 0, 0], dtype=int32)
In [160]: Mc.data                                                                              
Out[160]: array([1, 1, 1], dtype=int64)

And the direct nonzero on x:

In [161]: np.nonzero(x)                                                                        
Out[161]: (array([0, 0, 0]), array([ 4, 10, 21]))

For a 1 row matrix like this, I doubt if you'll save much time by creating the csr indptr directly. Most of the work will be in the nonzero step. But feel free to experiement.

===

Some timings

In [162]: timeit sparse.coo_matrix(x)                                                          
95.8 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [163]: timeit sparse.csr_matrix(x)                                                          
335 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [164]: timeit M.tocsr()                                                                     
115 µs ± 948 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [165]: timeit M.tocsc()                                                                     
117 µs ± 90.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [166]: sparse.csr_matrix?                                                                   
In [167]: timeit M.tocsc()                                                                     
117 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [168]: timeit sparse.csc_matrix(x)                                                          
335 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: timeit sparse.coo_matrix(x).tocsr()                                                  
219 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I'm a little surprised that the csr_matrix is slower than coo followed by conversion.

Now let's try to make the matrix with indptr etc.

In [170]: timeit np.nonzero(x)                                                                 
2.52 µs ± 65.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [173]: timeit sparse.csr_matrix((Mr.data, Mr.indices, Mr.indptr))                           
92.5 µs ± 79.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: %%timeit 
     ...: indices = np.nonzero(x)[1] 
     ...: data = np.ones_like(indices) 
     ...: indptr = np.array([0,len(indices)]) 
     ...: sparse.csr_matrix((data, indices, indptr)) 
     ...:  
     ...:                                                                                      
161 µs ± 605 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)