0
votes

I'm using the word2vec algorithm to detect the most important words in a document, my question is about how to compute the weight of an important word using the vector obtained from doc2vec, my code is like that:

model = Doc2Vec.load(fname)
word=["suddenly"]
vectors=model.infer_vector(word)

thank you for your consideration.

2
First you'll need to define what you mean by "important" and "weight".IVlad
weight: mean its degree of importance, my application is to detect topics for each block of text (topics means "important words")ucmou

2 Answers

0
votes

Let's say you can find the vector R corresponding to an entire document, using doc2vec. Let's also assume that using word2vec, you can find the vector v corresponding to any word w as well. And finally, let's assume that R and v are in the same N-dimensional space.

Assuming all this, you may then utilize plain old vector arithmetic to find out some correlations between R and v.

For starters, you can normalize v. Normalizing, after all, is just dividing each dimension with the magnitude of v. (i.e. |v|) Let's call the normalized version of v as v_normal.

Then, you may project v_normal onto the line represented by the vector R. That projection operation is just finding the dot product of v_normal and R, right? Let's call that scalar result of the dot product as len_projection. Well, you can consider len_projection / |v_normal| as an indication of how parallel the context of the word is to that of the overall document. In fact, considering just len_projection is enough, because in this case, as v_normal is normalized, |v_normal| == 1.

Now, you may apply this procedure to all words in the document, and consider the words that lead to greatest len_projection values as the most significant words of that document.

Note that this method may end up finding frequently-used words like "I", or "and" as the most important words in a document, as such words appear in many different contexts. If that is an issue you would want to remedy, you may want do a post-processing step to filter such common words.

I sort of thought of this method on the spot here, and I am not sure whether this approach has any scientific backing. But, it may make sense, if you think about how most vector embeddings for words work. Word vectors are usually trained to represent the context in which a word is used. Thinking in terms of vector arithmetic, projecting the vector onto a line may reveal how parallel the context of that word w is, to the overall context represented by that line.

Last but not least, as I have worked only with word2vec before, I am not sure if doc2vec and word2vec data can be used adjointly like I mentioned above. As I stated in the first paragraph of my answer, it is really critical that R and v has to be in the same N-dimensional space.

0
votes

When using infer_vector(), none of the words provided has any more 'weight' or 'importance'.

The process of inference tries to incrementally make a doc-vector that's generally pretty good at predicting all of the provided words, and every word is (in turn) an equally important target of the process.

Separately: inference rarely works well for tiny examples, like texts of just one or a few words. And inference usually works better when providing non-default values for its optional parameters. For example, instead of the default steps=5, a value of 20, 100, or more may be helpful, especially with smaller texts. Instead of the default starting alpha=0.1, a value of 0.025 (similar to the training default) often helps.