I am trying to train a word2vec model using gensim. This is the line I am using:
model = Word2Vec(training_texts, size=50, window=5, min_count=1, workers=4, max_vocab_size=20000)
Where training_texts is a list of lists of strings representing words. The corpora I am using has 8924372 sentences with 141,985,244 words and 1,531,477 unique words. After training, only 15642 words are present in the model:
len(list(model.wv.vocab))
# returns 15642
Shouldn't the model have 20,000 words, as specified max_vocab_size? Why is it missing most of the training words?
Thanks!!