Doc2Vec - Finding document similarity in test data

Question

I am trying to train a doc2vec model using training data, then finding the similarity of every document in the test data for a specific document in the test data using the trained doc2vec model. However, I am unable to determine how to do this.

I currently using model.docvecs.most_similar(...). However, this function only finds the similarity of every document in the training data for a specific document in the test data.

I have tried manually comparing the inferred vector of a specific document in the test data with the inferred vectors of every other document in the test data using model.docvecs.n_similarity(inferred_vector.tolist(), testvectors[i].tolist()) but this returns KeyError: "tag '-0.3502606451511383' not seen in training corpus/invalid" as there are vectors not in the dictionary.

gojomo gojomo · Accepted Answer · 2019-04-30T19:40:10

The act of training-up a Doc2Vec model leaves it with a record of the doc-vectors learned from the training data, and yes, most_similar() just looks among those vectors.

Generally, doing any operations on new documents that weren't part of training will require the use of infer_vector(). Note that such inference:

ignores any unknown words in the new document
may benefit from parameter tuning, especially for short documents
is currently done just one document at time in a single thread – so, acquiring inferred-vectors for a large batch of N-thousand docs can actually be slower than training a fresh model on the same N-thousand docs
isn't necessarily deterministic, unless you take extra steps, because the underlying algorithms use random initialization and randomized selection processes during training/inference
just gives you the vector, without loading it into any convenient storage-object for performing further most_similar()-like comparisons

On the other hand, such inference from a "frozen" model can be parallelized across processes or machines.

The n_similarity() method you mention isn't really appropriate for your needs: it's expecting lists of lookup-keys ('tags') for existing doc-vectors, not raw vectors like you're supplying.

The similarity_unseen_docs() method you mention in your answer is somewhat appropriate, but just takes a pair of docs, re-calculating their vectors each time – somewhat wasteful if a single new document's doc-vector needs to be compared against many other new documents' doc-vectors.

You may just want to train an all-new model, with both your "training documents" and your "test documents". Then all the "test documents" get their doc-vectors calculated, and stored inside the model, as part of the bulk training. This is an appropriate choice for many possible applications, and indeed could learn interesting relationships based on words that only appear in the "test docs" in a totally unsupervised way. And there's not yet any part of your question that gives reasons why it couldn't be considered here.

Alternatively, you'd want to infer_vector() all the new "test docs", and put them into a structure like the various KeyedVectors utility classes in gensim - remembering all the vectors in one array, remembering the mapping from doc-key to vector-index, and providing an efficient bulk most_similar() over the set of vectors.

Doc2Vec - Finding document similarity in test data

2 Answers