0
votes

I was training my word2vec model (skip-gram) on a vocab size of 100 000. But at the time of testing I got few words which weren't in the vocab. To find their embeddings I tried 2 approaches:

  1. Calculate minimum edit distance word from vocab and acquire its embedding.

  2. Constructed different n-grams from the word and searched them in the vocab.

Despite of applying these methods, I am not able to get rid of out of vocab words problem completely.

Does word2vec take all n-grams of a word into account while training like fastText does?

Note - In fastText if our input word is quora then it considers all of its possible n-grams in the corpus.

https://www.quora.com/How-does-fastText-output-a-vector-for-a-word-that-is-not-in-the-pre-trained-model

1

1 Answers

0
votes

I would assume that the out-of-vocabulary words in your case were very rare ones. One of the possibilities is to use hash of designated symbol (or another very rare word) as a sentinel for such out-of-vocabulary words. This requires preprocessing such words but should be good enough in a practical application.