1
votes

I'm trying to estimate the cosine similarity between each document i in a Corpus A and all documents in a Corpus B.

Any idea how I can do this efficiently? I'm working with pretty large datasets.

Essentially, I want to get the document(s) in Corpus B which is (are) most similar for each document within Corpus A.

2
Is this literally word processing? Does Bag of Words algorithm help to condense the size of the problem?gnodab
It's certainly not a bag of words problem.Vash
What makes you say it isn't a bag-of-words problem? (That will reduce each doc to a vector, and then you can do the pairwise-cosine-similarity calculations you've mentioned.) DId you try it and it didn't work? What made its results unsatisfactory?gojomo

2 Answers

1
votes

Take a look at the Vector Space Model. That article references representing the documents as a tf-idf statistic or term frequency–inverse document frequency. That may help embed the document in way that cosign similarity can be computed efficiently.

I would construct a (dis)similarity matrix where each column corresponds to the distances from the document at row zero. Each row can be computed independently. So if you could parallelize the computation.

0
votes
  1. Calculate document embedding for each document in corpus A & B using sentence-transformers.
  2. Calculate cosine similarity between each document embedding in A with that in B.
  3. Sort the arrays of dot-product by cosine similarity.
  4. Extract top N documents in B for each document in A.