0
votes

I'm a very new student of doc2vec and have some questions about document vector. What I'm trying to get is a vector of phrase like 'cat-like mammal'. So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below

import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]

When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'. Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me) So I've searched and found infer_vector() and tried the code below

phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)

When I tried this code, I could get a vector, but every time I get different value when I tried phrase_vec = m.infer_vector(phrase) Because infer_vector has 'steps'.

When I set steps=0, I get always the same vector. phrase_vec = m.infer_vector(phrase, steps=0)

However, I also found that document vector is obtained from averaging words in document. like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)

So here are some questions.

  1. Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
  2. If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
  3. What is a model.docvecs for?
1

1 Answers

0
votes

Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.

You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.

Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().

The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.