How to embed out of vocab words at the time of testing in word2vec model?

Question

I was training my word2vec model (skip-gram) on a vocab size of 100 000. But at the time of testing I got few words which weren't in the vocab. To find their embeddings I tried 2 approaches:

Calculate minimum edit distance word from vocab and acquire its embedding.
Constructed different n-grams from the word and searched them in the vocab.

Despite of applying these methods, I am not able to get rid of out of vocab words problem completely.

Does word2vec take all n-grams of a word into account while training like fastText does?

Note - In fastText if our input word is quora then it considers all of its possible n-grams in the corpus.

sophros sophros · Accepted Answer · 2018-03-26T18:32:06

I would assume that the out-of-vocabulary words in your case were very rare ones. One of the possibilities is to use hash of designated symbol (or another very rare word) as a sentinel for such out-of-vocabulary words. This requires preprocessing such words but should be good enough in a practical application.

How to embed out of vocab words at the time of testing in word2vec model?

1 Answers