1
votes

I have two arrays A and B. In NumPy you can use A as an index to B e.g.

A = np.array([[1,2,3,1,7,3,1,2,3],[4,5,6,4,5,6,4,5,6],[7,8,9,7,8,9,7,8,9]])
B= np.array([1,2,3,4,5,6,7,8,9,0])
c = B[A]

Which produces:

[[2 3 4 2 8 4 2 3 4] [5 6 7 5 6 7 5 6 7] [8 9 0 8 9 0 8 9 0]]

However, in my case the arrays A and B are SciPy CSR sparse arrays and they don't seem to support indexing.

A_sparse = sparse.csr_matrix(A)
B_sparse = sparse.csr_matrix(B)
c = B_sparse[A_sparse]

This results in:

IndexError: Indexing with sparse matrices is not supported except boolean indexing where matrix and index are equal shapes.

I've come up with the function below to replicate NumPy's behavior with the sparse arrays:

 def index_sparse(A,B):       
        A_sparse = scipy.sparse.coo_matrix(A)
        B_sparse = sparse.csr_matrix(B)
        res = sparse.csr_matrix(A_sparse)
        for i,j,v in zip(A_sparse.row, A_sparse.col, A_sparse.data):
            res[i,j] = B_sparse[0, v]
        return res

res = index_sparse(A, B)
print res.todense()

Looping over the array and having to create a new array in Python isn't ideal. Is there a better way of doing this using built-in functions from SciPy/ NumPy?

1

1 Answers

1
votes

Sparse indexing is less developed. coo format for example doesn't implement it at all.

I haven't tried to implement this problem, though I have answered others that involve working with the sparse format attributes. So I'll just make some general observations.

B_sparse is a matrix, so its shape is (1,10). So the equivalent to B[A] is

In [294]: B_sparse[0,A]
Out[294]: 
<3x9 sparse matrix of type '<class 'numpy.int32'>'
    with 24 stored elements in Compressed Sparse Row format>
In [295]: _.A
Out[295]: 
array([[2, 3, 4, 2, 8, 4, 2, 3, 4],
       [5, 6, 7, 5, 6, 7, 5, 6, 7],
       [8, 9, 0, 8, 9, 0, 8, 9, 0]], dtype=int32)

B_sparse[A,:] or B_sparse[:,A] gives a 3d warning, since it would be trying to create a matrix version of:

In [298]: B[None,:][:,A]
Out[298]: 
array([[[2, 3, 4, 2, 8, 4, 2, 3, 4],
        [5, 6, 7, 5, 6, 7, 5, 6, 7],
        [8, 9, 0, 8, 9, 0, 8, 9, 0]]])

As to your function:

A_sparse.nonzero() does A_sparse.tocoo() and returns its row and col. Effectively the same as what you do.

Here's something that should be faster, though I haven't tested it enough to be sure it is robust:

In [342]: Ac=A_sparse.tocoo()
In [343]: res=Ac.copy()
In [344]: res.data[:]=B_sparse[0, Ac.data].A[0]
In [345]: res
Out[345]: 
<3x9 sparse matrix of type '<class 'numpy.int32'>'
    with 27 stored elements in COOrdinate format>
In [346]: res.A
Out[346]: 
array([[2, 3, 4, 2, 8, 4, 2, 3, 4],
       [5, 6, 7, 5, 6, 7, 5, 6, 7],
       [8, 9, 0, 8, 9, 0, 8, 9, 0]], dtype=int32)

In this example there are 2 zeros that could cleaned up as well (look at res.nonzero()).

Since you are setting each res[i,j] with values from Ac.row and Ac.col, res has the same row,col values as Ac, so I initialize it as a copy. Then it's just a matter of updating the res.data attribute. It would be faster to index Bc.data directly, but that doesn't account for its sparsity.