using scikit-learn cosine_similarity on Dask array- python

Question

I have Dask to handle big array of vectors that cant fit in memory and using scikit-learn cosine_similarity to compute cosine similarity between those vector i.e:

import dask.array as da
from sklearn.metrics.pairwise import cosine_similarity
vectors = da.from_array(vectors, 10000)
sims_mat = cosine_similarity(vectors)

Works fine but I am not sure if in this way I have any benefits of using Dask or should I look for cosine similarity function for dask arrays

frozencure frozencure · Accepted Answer · 2018-03-14T16:26:58

In my opinion this should be fine, because if you check the documentation of both dask and sklearn, you will find out that both are built on top numpy, which uses parallel processing.

If you really want to use only dask you could check this repo out: https://pypi.python.org/pypi/dask-distance

It includes a cosine similarity function.

using scikit-learn cosine_similarity on Dask array- python

1 Answers