Have you looked at the underlying data representation for this, and other sparse formats?
For example, for small matrix
In [257]: M = sparse.rand(10,10,.1,format='csr')
In [258]: M
Out[258]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [259]: M.data
Out[259]:
array([ 0.86390256, 0.85244302, 0.88549326, 0.78737361, 0.99918561,
0.89862529, 0.86842524, 0.25714778, 0.4174032 , 0.33137501])
In [260]: M.indices
Out[260]: array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32)
In [261]: M.indptr
Out[261]: array([ 0, 1, 1, 3, 4, 4, 5, 5, 7, 8, 10], dtype=int32)
For csr
the indices are a bit obscure. Or rather, the column index for each nonzero value is present in M.indices
, but it takes a bit of calculation to determine which ones belong to which row.
For other formats the connection is more obvious.
For lil
we have 2 lists of lists
In [262]: Ml=M.tolil()
In [263]: Ml.data
Out[263]:
array([[0.863902562935336], [], [0.8524430195076207, 0.8854932609233054],
[0.7873736126927198], [], [0.9991856090158101], [],
[0.8986252926235274, 0.8684252408594123], [0.2571477751356357],
[0.4174032029993796, 0.3313750148434619]], dtype=object)
In [264]: Ml.rows
Out[264]: array([[1], [], [5, 8], [8], [], [9], [], [0, 3], [9], [4, 5]], dtype=object)
In [265]: [np.random.choice(x) for x in Ml.rows if x]
# some rows might not have any nonzero
Out[265]: [1, 5, 8, 9, 3, 9, 5]
In [268]: [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
Out[268]: [1, 5, 8, 9, 0, 9, 4]
You can also take nonzero
for the whole matrix
In [274]: M.nonzero()
Out[274]:
(array([0, 2, 2, 3, 5, 7, 7, 8, 9, 9], dtype=int32),
array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32))
These are the same arrays that you would get with M.tocoo()
and looking at the row
and col
attributes. In theory you could use groupby
to get sublists of columns, and pick from those. But again you have lists or generators and iteration.
I don't know if anyone of these representations is faster or not.
There may be some limits to vectorizing the problem. The number of nonzeros (the inputs to choices
) will differ by row. Some rows have non, others 1 or more. Anytime you encounter arrays or lists of different lengths, it is difficult to vectorize the operation. If you can't arrange the values into a regular 2d array you can't operate on them as a whole with array operations.
The lil
format is worth looking at:
In [276]: timeit [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
100 loops, best of 3: 4.24 ms per loop
In [289]: timeit [np.random.choice(row.indices) for row in M if len(row.indices)]
1000 loops, best of 3: 1.52 ms per loop
# 3x speedup using row.indices
In [277]: %%timeit
.....: Ml=M.tolil()
.....: [np.random.choice(x) for x in Ml.rows if x]
.....:
10000 loops, best of 3: 181 µs per loop
i,np.random.choice(row.nonzero()[1]: tuple index out of range
– farhawanp.zeros((sparse_matrix.shape[0]))
and that will give a 1d array? – farhawarow
is a row fromsparse_matrix
not fromrandom_columns
. – wxssparse_matrix
? – farhawa