1
votes

I have Dask to handle big array of vectors that cant fit in memory and using scikit-learn cosine_similarity to compute cosine similarity between those vector i.e:

import dask.array as da
from sklearn.metrics.pairwise import cosine_similarity
vectors = da.from_array(vectors, 10000)
sims_mat = cosine_similarity(vectors)

Works fine but I am not sure if in this way I have any benefits of using Dask or should I look for cosine similarity function for dask arrays

1

1 Answers

2
votes

In my opinion this should be fine, because if you check the documentation of both dask and sklearn, you will find out that both are built on top numpy, which uses parallel processing.

If you really want to use only dask you could check this repo out: https://pypi.python.org/pypi/dask-distance

It includes a cosine similarity function.