How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Question

To manipulate Scipy matrices, typically, the built-in methods are used. But sometimes you need to read the matrix data to assign it to non-sparse data types. For the sake of demonstration I created a random LIL sparse matrix and converted it to a Numpy array (pure python data types would have made a better sense!) using different methods.

from __future__ import print_function
from scipy.sparse import rand, csr_matrix, lil_matrix
import numpy as np

dim = 1000
lil = rand(dim, dim, density=0.01, format='lil', dtype=np.float32, random_state=0)
print('number of nonzero elements:', lil.nnz)
arr = np.zeros(shape=(dim,dim), dtype=float)

number of nonzero elements: 10000

Reading by indexing

%%timeit -n3
for i in xrange(dim):
    for j in xrange(dim):
        arr[i,j] = lil[i,j]

3 loops, best of 3: 6.42 s per loop

Using the `nonzero()` method

%%timeit -n3
nnz = lil.nonzero() # indices of nonzero values
for i, j in zip(nnz[0], nnz[1]):
    arr[i,j] = lil[i,j]

3 loops, best of 3: 75.8 ms per loop

Using the built-in method to convert directly to array

This one is not a general solution for reading the matrix data, so it does not count as a solution.

%timeit -n3 arr = lil.toarray()

3 loops, best of 3: 7.85 ms per loop

Reading Scipy sparse matrices with these methods is not efficient at all. Is there any faster way to read these matrices?

What's the time for iterating through all the values of the dense array, arr? That's not fast. But in general indexing a sparse matrix is slower than indexing a dense array. If you really need speed, work directly with the data attributes of the matrix - at the expense of generality. — hpaulj
56.4 ms. I actually learned about raw data couple of months ago and I used it @ github.com/aahoo/dbscan/blob/master/dbscan.py code. I just wanted others to know about it as well. Thanks for your great answer btw. While there was similar questions on SO, none have a general title and organized answer. That's why I couldn't find them back then and I had to learn it the hard way. — Thoran

hpaulj hpaulj · Accepted Answer · 2016-05-03T22:55:27

A similar question, but dealing setting sparse values, rather than just reading them:

Efficient incremental sparse matrix in python / scipy / numpy

More on accessing values using the underlying representation

Efficiently select random non-zero column from each row of sparse matrix in scipy

Also

why is row indexing of scipy csr matrices slower compared to numpy arrays

Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

Take a look at what M.nonzero does:

    A = self.tocoo()
    nz_mask = A.data != 0
    return (A.row[nz_mask],A.col[nz_mask])

It converts the matrix to coo format and returns the .row, and .col attributes - after filtering out any 'stray' 0s in the .data attribute.

So you could skip the middle man and use those attributes directly:

 A = lil.tocoo()
 for i,j,d in zip(A.row, A.col, A.data):
      a[i,j] = d

This is almost as good as the toarray:

In [595]: %%timeit
   .....: aa = M.tocoo()
   .....: for i,j,d in zip(aa.row,aa.col,aa.data):
   .....:   A[i,j]=d
   .....: 
100 loops, best of 3: 14.3 ms per loop

In [596]: timeit  arr=M.toarray()
100 loops, best of 3: 12.3 ms per loop

But if your target is really an array, you don't need to iterate

In [603]: %%timeit
   .....: A=np.empty(M.shape,M.dtype)
   .....: aa=M.tocoo()
   .....: A[aa.row,aa.col]=aa.data
   .....: 
100 loops, best of 3: 8.22 ms per loop

My times for @Thoran's 2 methods are:

100 loops, best of 3: 5.81 ms per loop
100 loops, best of 3: 17.9 ms per loop

Same ballpark of times.

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Reading by indexing

Using the `nonzero()` method

Using the built-in method to convert directly to array

2 Answers

Reading the raw data of LIL sparse matrix

Reading the raw data of CSR sparse matrix

Reading the raw data of COO sparse matrix

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Reading by indexing

Using the nonzero() method

Using the built-in method to convert directly to array

2 Answers

Reading the raw data of LIL sparse matrix

Reading the raw data of CSR sparse matrix

Reading the raw data of COO sparse matrix

Using the `nonzero()` method