0
votes

I'm trying to efficiently select a random non-zero column index for each row of a large sparse SciPy matrix. I can't seem to figure out a vectorized way of doing it, so I'm resorting to a very slow Python loop:

random_columns = np.zeros((sparse_matrix.shape[0]))
for i,row in enumerate(sparse_matrix):
    random_columns[i] = (np.random.choice(row.nonzero()[1]))

My matrix is an approximately (4000000, 800) csr_matrix with almost every row having only one non-zero value, so the Python loop is killing performance. There has to be a better way!

EDIT I can make it about 2x faster by directly accessing the underlying data of the csr_matrix:

random_columns[i] = row.indices[np.random.choice(len(row.data))]
1
your code throw i,np.random.choice(row.nonzero()[1]: tuple index out of rangefarhawa
Um well sparse_matrix needs to be 2 dimensional. This code works for me…wxs
yeah but you are selecting the first dimension to define np.zeros((sparse_matrix.shape[0])) and that will give a 1d array?farhawa
row is a row from sparse_matrix not from random_columns.wxs
what is the shape of sparse_matrix ?farhawa

1 Answers

2
votes

Have you looked at the underlying data representation for this, and other sparse formats?

For example, for small matrix

In [257]: M = sparse.rand(10,10,.1,format='csr')

In [258]: M
Out[258]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [259]: M.data
Out[259]: 
array([ 0.86390256,  0.85244302,  0.88549326,  0.78737361,  0.99918561,
        0.89862529,  0.86842524,  0.25714778,  0.4174032 ,  0.33137501])

In [260]: M.indices
Out[260]: array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32)

In [261]: M.indptr
Out[261]: array([ 0,  1,  1,  3,  4,  4,  5,  5,  7,  8, 10], dtype=int32)

For csr the indices are a bit obscure. Or rather, the column index for each nonzero value is present in M.indices, but it takes a bit of calculation to determine which ones belong to which row.

For other formats the connection is more obvious.

For lil we have 2 lists of lists

In [262]: Ml=M.tolil()

In [263]: Ml.data
Out[263]: 
array([[0.863902562935336], [], [0.8524430195076207, 0.8854932609233054],
       [0.7873736126927198], [], [0.9991856090158101], [],
       [0.8986252926235274, 0.8684252408594123], [0.2571477751356357],
       [0.4174032029993796, 0.3313750148434619]], dtype=object)

In [264]: Ml.rows
Out[264]: array([[1], [], [5, 8], [8], [], [9], [], [0, 3], [9], [4, 5]], dtype=object)

In [265]: [np.random.choice(x) for x in Ml.rows if x]
# some rows might not have any nonzero
Out[265]: [1, 5, 8, 9, 3, 9, 5]

In [268]: [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
Out[268]: [1, 5, 8, 9, 0, 9, 4]

You can also take nonzero for the whole matrix

 In [274]: M.nonzero()
 Out[274]: 
 (array([0, 2, 2, 3, 5, 7, 7, 8, 9, 9], dtype=int32),
 array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32))

These are the same arrays that you would get with M.tocoo() and looking at the row and col attributes. In theory you could use groupby to get sublists of columns, and pick from those. But again you have lists or generators and iteration.

I don't know if anyone of these representations is faster or not.

There may be some limits to vectorizing the problem. The number of nonzeros (the inputs to choices) will differ by row. Some rows have non, others 1 or more. Anytime you encounter arrays or lists of different lengths, it is difficult to vectorize the operation. If you can't arrange the values into a regular 2d array you can't operate on them as a whole with array operations.


The lil format is worth looking at:

In [276]: timeit [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
100 loops, best of 3: 4.24 ms per loop

In [289]: timeit [np.random.choice(row.indices) for row in M if len(row.indices)]
1000 loops, best of 3: 1.52 ms per loop
# 3x speedup using row.indices

In [277]: %%timeit
   .....: Ml=M.tolil()
   .....: [np.random.choice(x) for x in Ml.rows if x]
   .....: 
10000 loops, best of 3: 181 µs per loop