0
votes

Suppose I have a corpus of short sentences of which the number of words ranges from 1 to around 500 and the average number of words is around 9. If I train a Gensim Word2vec model using window=5(which is the default), should I use all of the sentences? or I should remove sentences with low word count? If so, is there a rule of thumb for the minimum number of words?

1

1 Answers

1
votes

Texts with only 1 word are essentially 'empty' to the word2vec algorithm: there are no neighboring words, which are necessary for all training modes. You could drop them, but there's little harm in leaving them in, either. They're essentially just no-ops.

Any text with 2 or more words can contribute to the training.