Missing words when training word2vec model

Question

I am trying to train a word2vec model using gensim. This is the line I am using:

model = Word2Vec(training_texts, size=50, window=5, min_count=1, workers=4, max_vocab_size=20000)

Where training_texts is a list of lists of strings representing words. The corpora I am using has 8924372 sentences with 141,985,244 words and 1,531,477 unique words. After training, only 15642 words are present in the model:

len(list(model.wv.vocab))
# returns 15642

Shouldn't the model have 20,000 words, as specified max_vocab_size? Why is it missing most of the training words?

Thanks!!

gojomo gojomo · Accepted Answer · 2019-01-22T05:16:27

You can look at the unique words it discovered via model.wv.vocab.keys() or model.wv.vocab.index2entity.

Are they the words you expected? Can you list a word that you are sure you provided in training_texts that isn't there?

Note that training_texts should be a sequence of lists of string tokens. If you are only providing a sequence of strings, it will see each string character as a word, and only model those single-character "words". (With texts using latin-alphabet languages, this usually means just a few dozen "words", but if your texts include other languages' characters I suppose you could wind up with a count of 15642 unique single-character words.)

Missing words when training word2vec model

2 Answers