2
votes

I have two scipy sparse csr matrices with the exact same shape but potentially different data values and nnz value. I now want to get the top 10 elements of one matrix and increase the value on the same indices on the other matrix. My current approach is as follows:

idx = a.data.argpartition(-10)[-10:]
i, j = matrix.nonzero()

i_idx = i[idx]
j_idx = j[idx]

b[i_idx, j_idx] += 1

The reason I have to go this way is that a.data and b.data do not necessarily have the same number of elements and hence the indices would differ.

My question now is whether I can improve this in some way. As far as I know the nonzero procedure is not elegant as I have to allocate two new arrays and I am very tough on memory already. I can get the j_indices via csr_matrix.indices but what about the i_indices? Can I use the indptr in a nice way for that?

Happy for any hints.

1
indptr has one value per row (plus 1). It indicates where each row starts in the data and indices arrays. You can do the math, or you can convert the array tocoo(). Then row and col have values you want. But beware, there are some warnings about indices may not be sorted.hpaulj
Look at the code for nonzero. If converts the matrix to coo and returns the row and col.hpaulj
Does top 10 elements mean first 10 nonzeros in CSR format?paul-g

1 Answers

0
votes

I'm not sure what the "top 10 elements" means. I assume that if you have matrices A and B you want to set B[i, j] += 1 if A[i, j] is within the first 10 nonzero entries of A (in CSR format). I also assume that B[i, j] could be zero, which is the worst case performance wise, since you need to modify the sparsity structure of your matrix.

CSR is not an ideal format to use for changing sparsity structure. This is because every new insertion/deletion is O(nnzs) complexity (assuming the CSR storage is backed by an array - and it usually is).

You could use the DOK format for your second matrix (the one you are modifying), which provides O(1) access to elements.

I haven't benchmarked this but I suppose your option is 10 * O(nnzs) at worst, when you are adding 10 new nonzero values, whereas the DOK version should need O(nnzs) to build the matrix, then O(1) for each insertion and, finally, O(nnzs) to convert it back to CSR (assuming this is needed).