How to compute cosine similarity between 2 different CORPUSES?

Question

I'm trying to estimate the cosine similarity between each document i in a Corpus A and all documents in a Corpus B.

Any idea how I can do this efficiently? I'm working with pretty large datasets.

Essentially, I want to get the document(s) in Corpus B which is (are) most similar for each document within Corpus A.

Is this literally word processing? Does Bag of Words algorithm help to condense the size of the problem? — gnodab
What makes you say it isn't a bag-of-words problem? (That will reduce each doc to a vector, and then you can do the pairwise-cosine-similarity calculations you've mentioned.) DId you try it and it didn't work? What made its results unsatisfactory? — gojomo

gnodab gnodab · Accepted Answer · 2020-04-09T15:37:53

Take a look at the Vector Space Model. That article references representing the documents as a tf-idf statistic or term frequency–inverse document frequency. That may help embed the document in way that cosign similarity can be computed efficiently.

I would construct a (dis)similarity matrix where each column corresponds to the distances from the document at row zero. Each row can be computed independently. So if you could parallelize the computation.

How to compute cosine similarity between 2 different CORPUSES?

2 Answers