2
votes

I want to build a seq2seq chatbot with a pre-trained Embedding matrix. Does the pre-trained Embedding matrix, for example GoogleNews-vectors-negative300, FastText and GloVe, has the specific word vector for <EOS> and <UNK>?

1

1 Answers

6
votes

The pre-trained embedding has a specific vocabulary defined. The words which are not in vocabulary are called words also called oov( out of vocabulary) words. The pre-trained embedding matrix will not provide any embedding for UNK. There are various methods to deal with the UNK words.

  1. Ignore the UNK word
  2. Use some random vector
  3. Use Fasttext as pre-trained model because it solves the oov problem by constructing vector for the UNK word from n-gram vectors that constitutes a word.

If the number of UNK is low the accuracy won't get affected a lot. If the number is higher better to train embedding or use fast text.

"EOS" Token can also be taken (initialized) as a random vector.

Make sure the both random vectors are not the same.