0
votes

I'm trying to find the most similar documents to a new document. The doc2vec model was trained first, and now I'm introducing a new document; I've inferred the vector for the new document, but I don't know the ins and outs of doc2vec well... If the new document has a lot of words (in a row) that the old model never encountered, how will it be handled?

1

1 Answers

0
votes

A Doc2Vec model can only consider words in inference that it learned during training, from the training texts. Unknown words are simply ignored.

One implication: a document with all new words, passed to infer_vector(), will return a random result. All inference begins with from a low-magnitude random vector, which is then adjusted, in a training-like process, to better predict the words that are present. But with no known words, the model can perform no incrementally-improved predictions at all, so inference is a no-op after initialization.