2
votes

I am trying to calculate similarity between two documents which are comprised of more than thousands sentences.

Baseline would be calculating cosine similarity using BOW.

However, I want to capture more of semantic difference between documents.

Hence, I built word embedding and calculated documents similarity by generating document vectors by simply averaging all the word vectors in each of documents and measure cosine similarity between these documents vectors.

However, since the size of each input document is rather big, the results I get from using the method above are very similar to simple BOW cosine similarity.

I have two questions,

Q1. I found gensim module offers soft cosine similarity. But I am having hard time understanding the difference from the methods I used above, and I think it may not be the mechanism to calculate similarity between million pairs of documents.

Q2. I found Doc2Vec by gensim would be more appropriate for my purpose. But I recognized that training Doc2Vec requires more RAM than I have (32GB) (the size of my entire documents is about 100GB). Would there be any way that I train the model with small part(like 20GB of them) of entire corpus, and use this model to calculate pairwise similarities of entire corpus? If yes, then what would be the desirable train set size, and is there any tutorial that I can follow?

1

1 Answers

1
votes

Ad Q1: If the similarity matrix contains the cosine similarities of the word embeddings (which it more or less does, see Equation 4 in SimBow at SemEval-2017 Task 3) and if the word embeddings are L2-normalized, then the SCM (Soft Cosine Measure) is equivalent to averaging the word embeddings (i.e. your baseline). For a proof, see Lemma 3.3 in the Implementation Notes for the SCM. My Gensim implementation of the SCM (1, 2) additionally sparsifies the similarity matrix to keep the memory footprint small and to regularize the embeddings, so you will get slightly different results compared to vanilla SCM. If embedding averaging gives you similar results to simple BOW cosine similarity, I would question the quality of the embeddings.

Ad Q2: Training a Doc2Vec model on the entire dataset for one epoch is equivalent to training a Doc2Vec model on smaller segments of the entire dataset, one epoch for each segment. Just be aware that Doc2Vec uses document ids as a part of the training process, so you must ensure that the ids are still unique after the segmentation (i.e. the first document of the first segment must have a different id than the first document of the second segment).