3
votes

To manipulate Scipy matrices, typically, the built-in methods are used. But sometimes you need to read the matrix data to assign it to non-sparse data types. For the sake of demonstration I created a random LIL sparse matrix and converted it to a Numpy array (pure python data types would have made a better sense!) using different methods.

from __future__ import print_function
from scipy.sparse import rand, csr_matrix, lil_matrix
import numpy as np

dim = 1000
lil = rand(dim, dim, density=0.01, format='lil', dtype=np.float32, random_state=0)
print('number of nonzero elements:', lil.nnz)
arr = np.zeros(shape=(dim,dim), dtype=float)

number of nonzero elements: 10000

Reading by indexing

%%timeit -n3
for i in xrange(dim):
    for j in xrange(dim):
        arr[i,j] = lil[i,j]

3 loops, best of 3: 6.42 s per loop

Using the nonzero() method

%%timeit -n3
nnz = lil.nonzero() # indices of nonzero values
for i, j in zip(nnz[0], nnz[1]):
    arr[i,j] = lil[i,j]

3 loops, best of 3: 75.8 ms per loop

Using the built-in method to convert directly to array

This one is not a general solution for reading the matrix data, so it does not count as a solution.

%timeit -n3 arr = lil.toarray()

3 loops, best of 3: 7.85 ms per loop

Reading Scipy sparse matrices with these methods is not efficient at all. Is there any faster way to read these matrices?

2
What's the time for iterating through all the values of the dense array, arr? That's not fast. But in general indexing a sparse matrix is slower than indexing a dense array. If you really need speed, work directly with the data attributes of the matrix - at the expense of generality.hpaulj
56.4 ms. I actually learned about raw data couple of months ago and I used it @ github.com/aahoo/dbscan/blob/master/dbscan.py code. I just wanted others to know about it as well. Thanks for your great answer btw. While there was similar questions on SO, none have a general title and organized answer. That's why I couldn't find them back then and I had to learn it the hard way.Thoran

2 Answers

2
votes

A similar question, but dealing setting sparse values, rather than just reading them:

Efficient incremental sparse matrix in python / scipy / numpy

More on accessing values using the underlying representation

Efficiently select random non-zero column from each row of sparse matrix in scipy

Also

why is row indexing of scipy csr matrices slower compared to numpy arrays

Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

Take a look at what M.nonzero does:

    A = self.tocoo()
    nz_mask = A.data != 0
    return (A.row[nz_mask],A.col[nz_mask])

It converts the matrix to coo format and returns the .row, and .col attributes - after filtering out any 'stray' 0s in the .data attribute.

So you could skip the middle man and use those attributes directly:

 A = lil.tocoo()
 for i,j,d in zip(A.row, A.col, A.data):
      a[i,j] = d

This is almost as good as the toarray:

In [595]: %%timeit
   .....: aa = M.tocoo()
   .....: for i,j,d in zip(aa.row,aa.col,aa.data):
   .....:   A[i,j]=d
   .....: 
100 loops, best of 3: 14.3 ms per loop

In [596]: timeit  arr=M.toarray()
100 loops, best of 3: 12.3 ms per loop

But if your target is really an array, you don't need to iterate

In [603]: %%timeit
   .....: A=np.empty(M.shape,M.dtype)
   .....: aa=M.tocoo()
   .....: A[aa.row,aa.col]=aa.data
   .....: 
100 loops, best of 3: 8.22 ms per loop

My times for @Thoran's 2 methods are:

100 loops, best of 3: 5.81 ms per loop
100 loops, best of 3: 17.9 ms per loop

Same ballpark of times.

1
votes

Try reading the raw data. Scipy sparse matrices are stored in Numpy ndarrays each with different format.

Reading the raw data of LIL sparse matrix

%%timeit -n3
for i, (row, data) in enumerate(zip(lil.rows, lil.data)):
    for j, val in zip(row, data):
        arr[i,j] = val

3 loops, best of 3: 4.61 ms per loop

Reading the raw data of CSR sparse matrix

For csr matrix it is a bit less pythonic to read from raw data but it is worth the speed.

csr = lil.tocsr()

%%timeit -n3
start = 0
for i, end in enumerate(csr.indptr[1:]):
    for j, val in zip(csr.indices[start:end], csr.data[start:end]):
        arr[i,j] = val
    start = end

3 loops, best of 3: 8.14 ms per loop

Similar approach is used in this DBSCAN implementation.

Reading the raw data of COO sparse matrix

%%timeit -n3
for i,j,d in zip(coo.row, coo.col, coo.data):
    arr[i,j] = d

3 loops, best of 3: 5.97 ms per loop

Based on these limited tests:

  • COO matrix: cleanest
  • LIL matrix: fastest
  • CSR matrix: slowest and ugliest. The only good side is that conversion to/from CSR is extremely fast.

Edit: from @hpaulj I added COO matrix to have all the methods in one place.