How to get vectors for each document using Google News Word2Vec

Question

I am trying out Google's word2vec pre-trained model to get word embeddings. I am able to load the model in my code and I can see that I get a 300-dimensional representation of a word. Here is the code -

import gensim
from gensim import models
from gensim.models import Word2Vec
model = gensim.models.KeyedVectors.load_word2vec_format('/Downloads/GoogleNews-vectors-negative300.bin', binary=True)
dog = model['dog']
print(dog.shape)

which gives me below output -

>>> print(dog.shape)
(300,)

This works but I am interested in obtaining a vector representation for entire document and not just one word. How can I do it using word2vec model ?

dog_sentence = model['it is a cute little dog']
KeyError: "word 'it is a cute little dog' not in vocabulary"

I plan to apply these on many documents and then train a clustering model on topic of it to do unsupervised learning and topic modeling.

Raghava Dhanya Raghava Dhanya · Accepted Answer · 2020-11-02T18:10:18

Approach 1: You have to get vectors for each word and combine them, the most basic way would be to average them. You can also do weighted average by calculating weights for each word (ex: tf-idf).

Approach 2: Use doc2vec. You might have to retrain or get pre-trained doc2vec model for this.

How to get vectors for each document using Google News Word2Vec

2 Answers