3
votes

I have a bunch of data in SciPy compressed sparse row (CSR) format. Of course the majority of elements is zero, and I further know that all non-zero elements have a value of 1. I want to compute sums over different subsets of rows of my matrix. At the moment I am doing the following:

import numpy as np
import scipy as sp
import scipy.sparse

# create some data with sparsely distributed ones
data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
data = sp.sparse.csr_matrix(data, dtype='int8')

# generate column-wise sums over random subsets of rows
nrand = 1000
for k in range(nrand):
    inds = np.random.choice(data.shape[0], size=100, replace=False)

    # 60% of time is spent here
    extracted_rows = data[inds]

    # 20% of time is spent here
    row_sum = extracted_rows.sum(axis=0)

The last few lines there are the bottleneck in a larger computational pipeline. As I annotated in the code, 60% of time is spent slicing the data from the random indices, and 20% is spent computing the actual sum.

It seems to me I should be able to use my knowledge about the data in the array (i.e., any non-zero value in the sparse matrix will be 1; no other values present) to compute these sums more efficiently. Unfortunately, I cannot figure out how. Dealing with just data.indices perhaps? I have tried other sparsity structures (e.g. CSC matrix), as well as converting to dense array first, but these approaches were all slower than this CSR matrix approach.

2
Two functions that may be useful to you here are rows, cols = extracted_rows.nonzero() which gives you the indices of the nonzero components and possibly np.count_nonzero() which counts nonzero entries in a numb arraybenbo

2 Answers

1
votes

It is well known that indexing of sparse matrices is relatively slow. And there have SO questions about getting around that by accessing the data attributes directly.

But first some timings. Using data and ind as you show I get

In [23]: datad=data.A  # times at 3.76 ms per loop

In [24]: timeit row_sumd=datad[inds].sum(axis=0)
1000 loops, best of 3: 529 µs per loop

In [25]: timeit row_sum=data[inds].sum(axis=0)
1000 loops, best of 3: 890 µs per loop

In [26]: timeit d=datad[inds]
10000 loops, best of 3: 55.9 µs per loop

In [27]: timeit d=data[inds]
1000 loops, best of 3: 617 µs per loop

The sparse version is slower than the dense one, but not by a lot. The sparse indexing is much slower, but its sum is somewhat faster.

The sparse sum is done with a matrix product

def sparse.spmatrix.sum
     ....
    return np.asmatrix(np.ones((1, m), dtype=res_dtype)) * self

That suggests that faster way - turn inds into an appropriate array of 1s and multiply.

In [49]: %%timeit
   ....: b=np.zeros((1,data.shape[0]),'int8')
   ....: b[:,inds]=1
   ....: rowmul=b*data
   ....: 
1000 loops, best of 3: 587 µs per loop

That makes the sparse operation about as fast as the equivalent dense one. (but converting to dense is much slower)

==================

The last time test is missing the np.asmatrix that is present in the sparse sum. But times are similar, and the results are the same

In [232]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x1=np.asmatrix(b)*data
1000 loops, best of 3: 661 µs per loop

In [233]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x2=b*data
1000 loops, best of 3: 605 µs per loop

One produces a matrix, the other an array. But both are doing a matrix product, 2nd dim of B against 1st of data. Even though b is an array, the task is actually delegated to data and its matrix product - in a not so transparent a way.

In [234]: x1
Out[234]: matrix([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)

In [235]: x2
Out[235]: array([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)

b*data.A is element multiplication and raises an error; np.dot(b,data.A) works but is slower.

Newer numpy/python has a matmul operator. I see the same time pattern:

In [280]: timeit b@dataA           # dense product
100 loops, best of 3: 2.64 ms per loop

In [281]: timeit [email protected]           # slower due to `.A` conversion
100 loops, best of 3: 6.44 ms per loop

In [282]: timeit b@data             # sparse product
1000 loops, best of 3: 571 µs per loop

np.dot may also delegate action to sparse, though you have to be careful. I just hung my machine with np.dot(csr_matrix(b),data.A).

1
votes

Here's a vectorized approach after converting data to a dense array and also getting all those inds in a vectorized manner using argpartition-based method -

# Number of selections as a parameter
n = 100

# Get inds across all iterations in a vectorized manner as a 2D array.
inds2D = np.random.rand(nrand,data.shape[0]).argpartition(n)[:,:n]

# Index into data with those 2D array indices. Then, convert to dense NumPy array, 
# reshape and sum reduce to get the final output 
out = np.array(data.todense())[inds2D.ravel()].reshape(nrand,n,-1).sum(1)

Runtime test -

1) Function definitions :

def org_app(nrand,n):
    out = np.zeros((nrand,data.shape[1]),dtype=int)
    for k in range(nrand):
        inds = np.random.choice(data.shape[0], size=n, replace=False)        
        extracted_rows = data[inds]    
        out[k] = extracted_rows.sum(axis=0)    
    return out


def vectorized_app(nrand,n):
    inds2D = np.random.rand(nrand,data.shape[0]).argpartition(n)[:,:n]
    return np.array(data.todense())[inds2D.ravel()].reshape(nrand,n,-1).sum(1)

Timings :

In [205]: # create some data with sparsely distributed ones
     ...: data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
     ...: data = sp.sparse.csr_matrix(data, dtype='int8')
     ...: 
     ...: # generate column-wise sums over random subsets of rows
     ...: nrand = 1000
     ...: n = 100
     ...: 

In [206]: %timeit org_app(nrand,n)
1 loops, best of 3: 1.38 s per loop

In [207]: %timeit vectorized_app(nrand,n)
1 loops, best of 3: 826 ms per loop