I have a term document matrix as a sparse matrix (either a csr or coo matrix), and a feature vector for which I want to do similarity comparisons. I have the following methods I want to try:
1.) with the doc matrix as a csr matrix, turn it into an ndarray and then iterate over the rows and do a cosine similairy using scikit learns cosine similarity between ndarrays.
2.)with the doc matrix as a csr matrix, turn it into an ndarray and then matrix product the matrix with the vector and divide by magnitudes and inverse cosine to get similairity scores
3.) with the doc matrix as a coo matrix, use the zip function to quickly iterate over the indices (while keeping track of which row you're in) and do the cosine similarities without taking advantage of vectorization aspec of ndarrays
The fisrt method has poor memory performance for large matrices (since you have to convert to the dense form) although it takes advantage of the quick vectorization and builtin cosine similarity method.
The second method also has poor memory performance but takes advantage of vectorization even more than the first method although it requires more operations (although these operations would also be vectorized) but cant use the built in cosine similarity method
The third method keeps the sparsity of the matrix at the expense of the vectorization speed, but the zip method of iterating over a coo matrix is fast for large matrices. This implementation would be dirtier code and have no vectorization.
Which would be the best method?
Last, I was wondering if there were a way to iterate over the rows of a csr matrix (as ndarrays), and then do the vectorized cosine similarity. This method would only turn the individual rows into a dense form and allow for cosine similarity using the builtin functions thus being an intermediate way that preserves sparsity to an extent and also allows for vectorized operations. Is there a way to do this easily?