1
votes

I am trying out Google's word2vec pre-trained model to get word embeddings. I am able to load the model in my code and I can see that I get a 300-dimensional representation of a word. Here is the code -

import gensim
from gensim import models
from gensim.models import Word2Vec
model = gensim.models.KeyedVectors.load_word2vec_format('/Downloads/GoogleNews-vectors-negative300.bin', binary=True)
dog = model['dog']
print(dog.shape)

which gives me below output -

>>> print(dog.shape)
(300,)

This works but I am interested in obtaining a vector representation for entire document and not just one word. How can I do it using word2vec model ?

dog_sentence = model['it is a cute little dog']
KeyError: "word 'it is a cute little dog' not in vocabulary"

I plan to apply these on many documents and then train a clustering model on topic of it to do unsupervised learning and topic modeling.

2

2 Answers

0
votes

Approach 1: You have to get vectors for each word and combine them, the most basic way would be to average them. You can also do weighted average by calculating weights for each word (ex: tf-idf).

Approach 2: Use doc2vec. You might have to retrain or get pre-trained doc2vec model for this.

0
votes

That's a set of word-vectors. There's no single canonical way to turn word-vectors into vectors for longer runs of text, like sentences or documents.

You can try simply averaging the word-vectors for each word in the text. (To do this, you wouldn't pass the whole string text, but break it into words, look up each word-vector, then average all those vectors.)

That's quick and simple to calculate, and works OK as a baseline for some tasks, especially topical-analyses on very-short texts. But as it takes no account of grammar/word-order, and dilutes all words with all others, it's often outperformed by more sophisticated analyses.

Note also: that set of word-vectors was calculated by Google around 2013, from news articles. It will miss words and word-senses that have arisen since then, and its vectors will be flavored by the way news-articles are written - very different from other domains of language. If you have enough data, training your own word-vectors, on your own domain's texts, may outperform them in both word-coverage and vector-relevance.