I am currently working with python where I train a Word2Vec model using sentences that I provide. Then, I save and load the model to get the word embedding of each and every word in the sentences that were used to train the model. However, I get the following error.
KeyError: "word 'n1985_chicago_bears' not in vocabulary"
whereas, one of the sentences provided during training is as follows.
sportsteam n1985_chicago_bears teamplaysincity city chicago
Hence I would like to know why some words are missing from the vocabulary, despite being trained on those words from that sentence corpus.
Training the word2vec model on own corpus
import nltk
import numpy as np
from termcolor import colored
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA
#PREPARING DATA
fname = '../data/sentences.txt'
with open(fname) as f:
content = f.readlines()
# remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
#TOKENIZING SENTENCES
sentences = []
for x in content:
nltk_tokens = nltk.word_tokenize(x)
sentences.append(nltk_tokens)
#TRAINING THE WORD2VEC MODEL
model = Word2Vec(sentences)
words = list(model.wv.vocab)
model.wv.save_word2vec_format('model.bin')
Sample sentences from sentences.txt
sportsteam hawks teamplaysincity city atlanta
stadiumoreventvenue honda_center stadiumlocatedincity city anaheim
sportsteam ducks teamplaysincity city anaheim
sportsteam n1985_chicago_bears teamplaysincity city chicago
stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta
stadiumoreventvenue united_center stadiumlocatedincity city chicago
...
There are 1860 such lines in the sentences.txt
file, each containing exactly 5 words and no stop words.
After saving the model, I tried to load it from a different python file within the same directory as the saved model.bin
as shown below.
Loading the saved model.bin
import nltk
import numpy as np
from gensim import models
w = models.KeyedVectors.load_word2vec_format('model.bin', binary=True)
print(w['n1985_chicago_bears'])
However, I end up with the following error
KeyError: "word 'n1985_chicago_bears' not in vocabulary"
Is there a way to get the word embedding for each and every word in the trained sentence corpus using the same method?
Any suggestions in this regard will be much appreciated.