Implement pre-trained word embeddings in sentence level?

Question

I am trying to do a text classification, and using pre-trained Glove word embedding in sentence level. I am currently using very naive approach which is averaging words vectors to represent sentence.

The question is what if there is no pre-trained word appeared in the sentence, how should I do if this happens? Just ignore this sentence or randomly assign some values to this sentence vector? I can not find a reference that deal with this problem, most of paper just said they used averaging pre-trained word embeddings to generate sentence embedding.

gojomo gojomo · Accepted Answer · 2017-06-12T18:50:56

If a sentence has no words about which you know anything, any classification attempt will be a random guess.

It's impossible for such no-information sentences to improve your classifier, so they are better to leave out than to include with totally random features.

(There are some word-embedding techniques that can, for languages with subword morphemes, guess better-than-random word-vectors for previously-unknown words. See Facebook's 'FastText' tools, for example. But unless a large number of your texts are dominated by unknown words, you can probably defer investigation of such techniques until after validating if your general approach is working on easier texts.)

Implement pre-trained word embeddings in sentence level?

1 Answers