Words missing from trained word2vec model vocabulary

Question

I am currently working with python where I train a Word2Vec model using sentences that I provide. Then, I save and load the model to get the word embedding of each and every word in the sentences that were used to train the model. However, I get the following error.

KeyError: "word 'n1985_chicago_bears' not in vocabulary"

whereas, one of the sentences provided during training is as follows.

sportsteam n1985_chicago_bears teamplaysincity city chicago

Hence I would like to know why some words are missing from the vocabulary, despite being trained on those words from that sentence corpus.

Training the word2vec model on own corpus

import nltk
import numpy as np
from termcolor import colored
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA


#PREPARING DATA

fname = '../data/sentences.txt'

with open(fname) as f:
    content = f.readlines()

# remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]


#TOKENIZING SENTENCES

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

#TRAINING THE WORD2VEC MODEL

model = Word2Vec(sentences)

words = list(model.wv.vocab)
model.wv.save_word2vec_format('model.bin')

Sample sentences from sentences.txt

sportsteam hawks teamplaysincity city atlanta
stadiumoreventvenue honda_center stadiumlocatedincity city anaheim
sportsteam ducks teamplaysincity city anaheim
sportsteam n1985_chicago_bears teamplaysincity city chicago
stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta
stadiumoreventvenue united_center stadiumlocatedincity city chicago
...

There are 1860 such lines in the sentences.txt file, each containing exactly 5 words and no stop words.

After saving the model, I tried to load it from a different python file within the same directory as the saved model.bin as shown below.

Loading the saved model.bin

import nltk
import numpy as np
from gensim import models

w = models.KeyedVectors.load_word2vec_format('model.bin', binary=True)
print(w['n1985_chicago_bears'])

However, I end up with the following error

KeyError: "word 'n1985_chicago_bears' not in vocabulary"

Is there a way to get the word embedding for each and every word in the trained sentence corpus using the same method?

Any suggestions in this regard will be much appreciated.

mujjiga mujjiga · Accepted Answer · 2019-05-08T05:11:44

The default min_count=5 for Word2Vec implementation of gensim and looks like the token you are looking for n1985_chicago_bears occurs less then 5 times in your corpus. Change your min count appropriately.

Method signature:

class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)

content = [
    "sportsteam hawks teamplaysincity city atlanta",
    "stadiumoreventvenue honda_center stadiumlocatedincity city anaheim",
    "sportsteam ducks teamplaysincity city anaheim",
    "sportsteam n1985_chicago_bears teamplaysincity city chicago",
    "stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta",
    "stadiumoreventvenue united_center stadiumlocatedincity city chicago"
]

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

model = Word2Vec(sentences, min_count=1)
print (model['n1985_chicago_bears'])

Words missing from trained word2vec model vocabulary

1 Answers