I want to build a seq2seq chatbot with a pre-trained Embedding matrix. Does the pre-trained Embedding matrix, for example GoogleNews-vectors-negative300, FastText and GloVe, has the specific word vector for <EOS> and <UNK>?
2
votes
1 Answers
6
votes
The pre-trained embedding has a specific vocabulary defined. The words which are not in vocabulary are called words also called oov( out of vocabulary) words. The pre-trained embedding matrix will not provide any embedding for UNK. There are various methods to deal with the UNK words.
- Ignore the UNK word
- Use some random vector
- Use Fasttext as pre-trained model because it solves the oov problem by constructing vector for the UNK word from n-gram vectors that constitutes a word.
If the number of UNK is low the accuracy won't get affected a lot. If the number is higher better to train embedding or use fast text.
"EOS" Token can also be taken (initialized) as a random vector.
Make sure the both random vectors are not the same.