I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?
#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])
I receive an error: keyError: "word 'to' not in vocabulary"
is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!
For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.
#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})
then I calculate the mean of embedding vectors of most similar words to each missing word.
missing_embd={}
for key,value in word_to_idx.items():
if value==0:
similar_words=model.wv.most_similar(key)
similar_embeddings=[model.wv[a[0]] for a in similar_words]
missing_embd[key]=mean(similar_embeddings)
And then I add these news embeddings to word2vec model by:
for word,embd in missing_embd.items():
# model.wv.build_vocab(word,update=True)
model.wv.syn0[model.wv.vocab[word].index]=embd
There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:
for w in tokens_lower:
if(w in model.wv.vocab)==False:
print(w)
print("***********")
I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.
model.build_vocab(sentences, update=True)
. – Scratch'N'Purrmodel.build_vocab
I get this error: `AttributeError: 'KeyedVectors' object has no attribute 'build_vocab'. How can I use build_vocab? – Mahsamodel = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative33.bin', binary=True)
. If you manage to getbuild_vocab
to work afterwards, you would still have to do additional training usingmodel.train(sentences)
– Scratch'N'Purr