0
votes

I am trying to train a doc2vec model using training data, then finding the similarity of every document in the test data for a specific document in the test data using the trained doc2vec model. However, I am unable to determine how to do this.

I currently using model.docvecs.most_similar(...). However, this function only finds the similarity of every document in the training data for a specific document in the test data.

I have tried manually comparing the inferred vector of a specific document in the test data with the inferred vectors of every other document in the test data using model.docvecs.n_similarity(inferred_vector.tolist(), testvectors[i].tolist()) but this returns KeyError: "tag '-0.3502606451511383' not seen in training corpus/invalid" as there are vectors not in the dictionary.

2

2 Answers

1
votes

The act of training-up a Doc2Vec model leaves it with a record of the doc-vectors learned from the training data, and yes, most_similar() just looks among those vectors.

Generally, doing any operations on new documents that weren't part of training will require the use of infer_vector(). Note that such inference:

  • ignores any unknown words in the new document
  • may benefit from parameter tuning, especially for short documents
  • is currently done just one document at time in a single thread – so, acquiring inferred-vectors for a large batch of N-thousand docs can actually be slower than training a fresh model on the same N-thousand docs
  • isn't necessarily deterministic, unless you take extra steps, because the underlying algorithms use random initialization and randomized selection processes during training/inference
  • just gives you the vector, without loading it into any convenient storage-object for performing further most_similar()-like comparisons

On the other hand, such inference from a "frozen" model can be parallelized across processes or machines.

The n_similarity() method you mention isn't really appropriate for your needs: it's expecting lists of lookup-keys ('tags') for existing doc-vectors, not raw vectors like you're supplying.

The similarity_unseen_docs() method you mention in your answer is somewhat appropriate, but just takes a pair of docs, re-calculating their vectors each time – somewhat wasteful if a single new document's doc-vector needs to be compared against many other new documents' doc-vectors.

You may just want to train an all-new model, with both your "training documents" and your "test documents". Then all the "test documents" get their doc-vectors calculated, and stored inside the model, as part of the bulk training. This is an appropriate choice for many possible applications, and indeed could learn interesting relationships based on words that only appear in the "test docs" in a totally unsupervised way. And there's not yet any part of your question that gives reasons why it couldn't be considered here.

Alternatively, you'd want to infer_vector() all the new "test docs", and put them into a structure like the various KeyedVectors utility classes in gensim - remembering all the vectors in one array, remembering the mapping from doc-key to vector-index, and providing an efficient bulk most_similar() over the set of vectors.

0
votes

It turns out there is a function called similarity_unseen_docs(...) which can be used to find the similarity of 2 documents in the test data.

However, I will leave the question unsolved for now as it is not very optimal since I would need manually compare the specific document with every other document in the test data. Also, it compares the words in the documents instead of the vectors which could affect accuracy.