1
votes

I am trying to train a word2vec model using gensim. This is the line I am using:

model = Word2Vec(training_texts, size=50, window=5, min_count=1, workers=4, max_vocab_size=20000)

Where training_texts is a list of lists of strings representing words. The corpora I am using has 8924372 sentences with 141,985,244 words and 1,531,477 unique words. After training, only 15642 words are present in the model:

len(list(model.wv.vocab))
# returns 15642

Shouldn't the model have 20,000 words, as specified max_vocab_size? Why is it missing most of the training words?

Thanks!!

2

2 Answers

0
votes

You can look at the unique words it discovered via model.wv.vocab.keys() or model.wv.vocab.index2entity.

Are they the words you expected? Can you list a word that you are sure you provided in training_texts that isn't there?

Note that training_texts should be a sequence of lists of string tokens. If you are only providing a sequence of strings, it will see each string character as a word, and only model those single-character "words". (With texts using latin-alphabet languages, this usually means just a few dozen "words", but if your texts include other languages' characters I suppose you could wind up with a count of 15642 unique single-character words.)

0
votes

The words that appear in the model are OK, and they represent some of the usual relations ( king - boy + girl = queen . But I have identified words that appear several times in the corpora and are not in the model. I do not think it has something to see with how I am passing the data, but with some parameter I am missing.