0
votes

Question is very simple: Let's say I have a given row r from scipy sparse matrix M (100,000X500,000), I want to find its location/index in the M matrix? How can I accomplish this in an efficient way?

Currently I am trying the following way, but it is horribly slow.

offset = 500
begin = 0
end  = begin + offset
row = row.todense() #convert sparse to dense
while 1:
    sub_M = M[begin:end,:].todense() #M matrix is too big that its dense cannot fit memory 
    labels=np.all(row == sub_M, axis=1) # here we find row in the sub set of M, but in a dense representation
    begin = end
    end = end + offset
    if (end - offset) == M.shape[0]:
        break
    elif end > M.shape[0]:
        end = M.shape[0]
2

2 Answers

1
votes

Unless you want to dig into the internals of one or more sparse matrix types, you should use CSR format for your matrix and:

  • Calculate the length (L2 norm) of each matrix row; in other words: sum(multiply(M, M), 2)
  • Normalize r to (L2) length 1
  • Matrix multiply M*r (where r is treated as a column vector)

If an entry of M*r matches the length of the corresponding row, then you have a match.

Note that the default ord for numpy.linalg.norm is L2 norm.

0
votes

Finally, I came up with a very simple but much time efficient solution. Each row in the sparse matrix is converted to string and put into a dictionary along with its index/location. Then the row needs to be found is the key of the dictionary and dic[str(row)] gives me its index.