0
votes

I've loaded pretrained word2vec embeddings into a python dictionary of the form

{word: vector}

As an example, an element of this dictionary is

w2v_dict["house"] = [1.1,2.0, ... , 0.2]

I would like to load this model into Gensim (or a similar library) so that I can find euclidean distances between embeddings.

I understand that pretrained embeddings typically come in a .bin file which can be loaded into Gensim. But if I only have a dictionary of this form, how would I load the vectors into a model?

2

2 Answers

0
votes

Usually pre-trained word-vectors would come in a format gensim could natively read, for example via the load_word2vec_format() method. It's odd that you only have vectors in your own format.

So, I'd recommend writing your vectors to a text format compatible with other word2vec libraries You can review gensim's save_word2vec_format() method at:

https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/utils_any2vec.py#L131

You could also train up a dummy Word2Vec model with any junk/toy data, save its vectors in the text format (w2v_model.wv.save_word2vec_format(filename, binary=False)), and review the resulting file.

Using the above source code or example file, write your dictionary in a similar format. Then, use gensim's KeyedVectors.load_word2vec_format(filename) to read your vectors in.

0
votes

You can save it as gensim word2vec format then load it with gensim.models.KeyedVectors.load_word2vec_format.Details here.