0
votes

I try to map sentences to a vector in order to make sentences comparable to each other. To test gensim's Doc2Vec model, I downloaded sklearn's newsgroup dataset and trained the model on it.

In order to compare two sentences, I use model.infer_vector() and I am wondering why two calls using the same sentence delivers me different vectors:

model = Doc2Vec(vector_size=100, window=8, min_count=5, workers=6)
model.build_vocab(documents)

epochs=10
for epoch in range(epochs):
    print("Training epoch %d" % (epoch+1))
    model.train(documents,  total_examples=len(documents), epochs=epochs)

    v1 = model.infer_vector("I feel good")
    v2 = model.infer_vector("I feel good")
    print(np.linalg.norm(v1-v2)) 

Output:

Training epoch 1

0.41606528

Training epoch 2

0.43440753

Training epoch 3

0.3203116

Training epoch 4

0.3039317

Training epoch 5

0.68224543

Training epoch 6

0.5862567

Training epoch 7

0.5424634

Training epoch 8

0.7618142

Training epoch 9

0.8170159

Training epoch 10

0.6028216

If I set alpha and min_alpha = 0 I get consistent vectors for the "I feel fine" and "I feel good", but the model gives me the same vector in every epoch, so it does not seem to learn anything:

Training epoch 1

0.043668125

Training epoch 2

0.043668125

Training epoch 3

0.043668125

Training epoch 4

0.043668125

Training epoch 5

0.043668125

Training epoch 6

0.043668125

Training epoch 7

0.043668125

Training epoch 8

0.043668125

Training epoch 9

0.043668125

Training epoch 10

0.043668125

So my questions are:

  1. Why do I even have the possibility to specify a learning rate for inference? I would expect that the model is only changed during training and not during inference.

  2. If I specify alpha=0 for inference, why does the distance between those two vectors not change during different epochs?

1

1 Answers

4
votes

Inference uses an alpha because it is the same iterative adjustment process as training, just limited to updating the one new vector for the one new text example.

So yes, the model's various weights are frozen. But the one new vector's weights (dimensions) start at small random values, just as every other vector also began, and then get incrementally nudged over multiple training cycles to make the vector work better as a doc-vector for predicting the text's words. Then the final new-vector is returned.

Those nudges begin at the larger starting alpha value, and wind up as the negligible min_alpha. With an alpha at 0.0, no training/inference can happen, because every nudge-correction to the updatable weights is multiplied by 0.0 before it's applied, meaning no change happens.

Separate from that, your code has a number of problems that may prevent desirable results:

  • By calling train() epochs times in a loop, and then also supplying a value larger than 1 for epochs, you're actually performing epochs * epochs total training passes

  • further, by leaving alpha and min_alpha unspecified, each call to train() will descend the effective alpha from its high-value to its low-value each call – a sawtooth pattern that's not proper for this kind of stochastic gradient descent optimization. (There should be a warning in your logs about this error.)

It's rare to need to call train() multiple times in a loop. Just call it once, with the right epochs value, and it will do the right thing: that many passes, with a smoothly-decaying alpha learning-rate.

Separately, when calling infer_vector():

  • it needs a list-of-tokens, just like the words property of the training examples that were items in documentsnot a string. (By supplying a string, it looks like a list-of-characters, so it will be inferring a doc-vector for the document ['I', ' ', 'f', 'e', 'e', 'l', ' ', 'g', 'o', 'o', 'd'] not ['I', 'feel', 'good'].)

  • those tokens should be preprocessed the same as the training documents – for example if they were lowercased there, they should be lowercased before passing to infer_vector()

  • the default argument passes=5 is very small, especially for short texts – many report better results with a value in the tens or hundreds

  • the default argument alpha=0.1 is somewhat large compared to the training default 0.025; using the training value (especially with more passes) often gives better results

Finally, just like the algorithm during training makes use of randomization (to adjust word-prediction context windows, or randomly-sample negative examples, or randomly down-sample highly-frequent words), the inference does as well. So even supplying the exact same tokens won't automatically yield the exact same inferred-vector.

However, if the model has been sufficiently-trained, and the inference is adjusted as above for better results, the vectors for the same text should be very, very close. And because this is a randomized algorithm with some inherent 'jitter' between runs, it's best to make your downstream evaluations and uses tolerant to such small variances. (And, if you're instead seeing large variances, correct other model/inference issues, usually with more data or other parameter adjustments.)

If you want to force determinism, there's some discussion of how to do that in a gensim project issue. But, understanding & tolerating the small variances is often more consistent with the choice of such a randomly-influenced algorithm.