12
votes

I have two scipy_sparse_csr_matrix 'a' and scipy_sparse_csr_matrix(boolean) 'mask', and I want to set elements of 'a' to zero where element of mask is True.

for example

>>>a
<3x3 sparse matrix of type '<type 'numpy.int32'>'
    with 4 stored elements in Compressed Sparse Row format>
>>>a.todense()
matrix([[0, 0, 3],
        [0, 1, 5],
        [7, 0, 0]])

>>>mask
<3x3 sparse matrix of type '<type 'numpy.bool_'>'
    with 4 stored elements in Compressed Sparse Row format>
>>>mask.todense()
matrix([[ True, False,  True],
        [False, False,  True],
        [False,  True, False]], dtype=bool)

Then I want to obtain the following result.

>>>result
<3x3 sparse matrix of type '<type 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>
>>>result.todense()
matrix([[0, 0, 0],
        [0, 1, 0],
        [7, 0, 0]])

I can do it by operation like

result = a - a.multiply(mask)

or

a -= a.multiply(mask) #I don't care either in-place or copy.

But I think above operations are inefficient. Since actual shape of 'a' and 'mask' are 67,108,864 × 2,000,000, these operations take several seconds on high spec server(64 core Xeon cpu, 512GB memory). For example, 'a' has about 30,000,000 non-zero elements, and 'mask' has about 1,800,000 non-zero(True) elements, then above operation take about 2 seconds.

Is there more efficient way to do this?

Conditions are below.

  1. a.getnnz() != mask.getnnz()
  2. a.shape = mask.shape

Thanks!

Other way(tried)

a.data*=~np.array(mask[a.astype(np.bool)]).flatten();a.eliminate_zeros() #This takes twice the time longer than above method.
1
How do the nnz of a and mask compare? Besides not being the same. Are both equally sparse?hpaulj
Thank you for your answer, @hpaulj! I'm sorry to confuse you. The condition 1 only means 'The number of non-zero(false) elements of mask is different from the number of elements of a.'.hiroto1228

1 Answers

1
votes

My initial impression is that this multiply and subtract approach is a reasonable one. Quite often sparse code implements operations as some sort of multiplication, even if the dense equivalents use more direct methods. The sparse sum over rows or columns uses a matrix multiplication with the appropriate row or column matrix of 1s. Even row or column indexing uses matrix multiplication (at least on the csr format).

Sometimes we can improve on operations by working directly with the matrix attributes (data, indices, indptr). But that requires a lot more thought and experimentation.

For the dense arrays my first try would be

In [611]: a.A*~(mask.A)
Out[611]: 
array([[0, 0, 0],
       [0, 1, 0],
       [7, 0, 0]], dtype=int32)

But there isn't a direct way of doing not to a sparse matrix. If mask was indeed sparse, ~mask would not be. In your example mask has 4 True terms, and 5 False, so a dense version would work just as well:

In [612]: nmask=sparse.csr_matrix(~(mask.A))
In [615]: a.multiply(nmask)
Out[615]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>

CSR scipy matrix does not update after updating its values explores setting the diagonal of a sparse matrix to 0. It is possible to set values of the data attribute to 0, and then eliminate_zeros once at the end.

The other dense method is

In [618]: a1=a.A
In [619]: a1[mask.A]=0

This also works in sparse - sort of

In [622]: a2=a.copy()
In [624]: a2[mask]
Out[624]: matrix([[0, 3, 5, 0]], dtype=int32)
In [625]: a2[mask]=0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
In [626]: a2
Out[626]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 6 stored elements in Compressed Sparse Row format>

As noted in the previous question, we need to eliminate the zeros:

In [628]: a2.eliminate_zeros()
In [629]: a2
Out[629]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>

Taking a hint from the sparsity warning let's try the lil format

In [638]: al=a.tolil()
In [639]: al[mask]
Out[639]: 
<1x4 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in LInked List format>
In [640]: al[mask]=0
In [641]: al
Out[641]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in LInked List format>

It's interesting that al[mask] is still sparse, where as a[mask] is dense. The 2 formats use different indexing methods.

At some low level of sparsity, it might be worth iterating over the True (nonzero) elements of mask, setting the corresponding terms of a to zero directly.

I'm not going to guess as to the relative speeds of these methods. That needs to be tested on realistic data.