add new words to GoogleNews by gensim

Question

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?

#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])

I receive an error: keyError: "word 'to' not in vocabulary" is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!

For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.

#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})

then I calculate the mean of embedding vectors of most similar words to each missing word.

missing_embd={}
for key,value in word_to_idx.items():
    if value==0:
        similar_words=model.wv.most_similar(key)
        similar_embeddings=[model.wv[a[0]] for a in similar_words]
        missing_embd[key]=mean(similar_embeddings)

And then I add these news embeddings to word2vec model by:

for word,embd in missing_embd.items():
    # model.wv.build_vocab(word,update=True)
    model.wv.syn0[model.wv.vocab[word].index]=embd

There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:

for w in tokens_lower:
    if(w in model.wv.vocab)==False:
        print(w)
        print("***********")

I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

The GoogleNews word2vec model probably excluded 'to' and 'a' due to their insignificance as stopwords. I don't think it's possible to update the model vocab since the model is generated from the C tool per the tutorial found here, but you can give it a shot with model.build_vocab(sentences, update=True). — Scratch'N'Purr
you mean that after loading moldel I use model.build_vocab(sentences,update=True)? and now what are the embedding vectors for missing word? — Mahsa
Yes, you can try that, but again, I don't think its possible since Google's word2vec model was built with the C toolkit. You won't be able to get any similar embeddings for missing words since the model vocab never had these words to train on. — Scratch'N'Purr
Thanks for your comments. But when I use model.build_vocab I get this error: `AttributeError: 'KeyedVectors' object has no attribute 'build_vocab'. How can I use build_vocab? — Mahsa
Hmmm try this instead: model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative33.bin', binary=True). If you manage to get build_vocab to work afterwards, you would still have to do additional training using model.train(sentences) — Scratch'N'Purr

Somex Gupta Somex Gupta · Accepted Answer · 2018-12-30T05:36:09

Here is a scenario where we are adding a missing lower case word.

from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)

'Quoran' in embedding.vocab
 Output : True

'quoran' in embedding.vocab
 Output : False

Here Quoran is present but quoran in lower case is missing

# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)

'quoran' in embedding.vocab
 Output : True

add new words to GoogleNews by gensim

2 Answers