Finding the top n values in a row of a scipy sparse matrix

Question

I have a scipy sparse matrix in CSR format. It's 72665x72665 so it's impractical to convert this matrix to a dense matrix to perform operations on (the dense representation of this matrix is like 40 gigs). The matrix is symmetric, and has about 82 million non-zero entries (~1.5%).

What I would like to be able to do is, for each row, I want to get the indices of the largest N values. If this were a numpy array, I would use np.argpartition to do it like so:

    for row in matrix:
        top_n_idx = np.argpartition(row,-n)[-n:]

Is there something similar to this I can do for a sparse matrix?

Similar questions from the past: Get top-n items of every row in a scipy sparse matrix, and Scipy.sparse.csr_matrix: How to get top ten values and indices? — hpaulj

Louis Yang Louis Yang · Accepted Answer · 2018-09-12T23:19:37

Improve from @Paul Panzer's solution. Now it can handle the case when any row has less than n values.

def top_n_idx_sparse(matrix, n):
    '''Return index of top n values in each row of a sparse matrix'''
    top_n_idx = []
    for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
        n_row_pick = min(n, ri - le)
        top_n_idx.append(matrix.indices[le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]])
    return top_n_idx

Finding the top n values in a row of a scipy sparse matrix

2 Answers