0
votes

I am trying to do a text classification, and using pre-trained Glove word embedding in sentence level. I am currently using very naive approach which is averaging words vectors to represent sentence.

The question is what if there is no pre-trained word appeared in the sentence, how should I do if this happens? Just ignore this sentence or randomly assign some values to this sentence vector? I can not find a reference that deal with this problem, most of paper just said they used averaging pre-trained word embeddings to generate sentence embedding.

1

1 Answers

0
votes

If a sentence has no words about which you know anything, any classification attempt will be a random guess.

It's impossible for such no-information sentences to improve your classifier, so they are better to leave out than to include with totally random features.

(There are some word-embedding techniques that can, for languages with subword morphemes, guess better-than-random word-vectors for previously-unknown words. See Facebook's 'FastText' tools, for example. But unless a large number of your texts are dominated by unknown words, you can probably defer investigation of such techniques until after validating if your general approach is working on easier texts.)