Is there a reason to not normalize the document output vectors of Doc2Vec for clustering?

Question

I know that in Word2Vec the length of word vectors could encode properties like term frequency. In that case, we can see two word vectors, say synonyms, with a similar meaning but with a different length given their usage in our corpus.

However, if we normalize the word vectors, we keep their "directions of meaning" and we could clusterize them according that: meaning.

Following that train of thought, the same would be applicable to document vectors in Doc2Vec.

But my question is, is there a reason to NOT normalize document vectors if we want to cluster them? In Word2Vec we can say we want to keep the frequency property of the words, is there a similar thing for documents?

gojomo gojomo · Accepted Answer · 2019-09-23T00:46:08

I'm not familiar with any reasoning or research precedent which implies that either unit-normalized or non-normalized document-vectors are better for clustering.

So, I'd try both to see which seems to work better for your purposes.

Other thoughts:

In Word2Vec, my general impression is that larger-magnitude word-vectors are associated with words that, in the training data, have more unambiguous meaning. (That is, they reliably tend to imply the same smaller set of neighboring words.) Meanwhile, words with multiple meanings (polysemy) and usage amongst many other diverse words tend to have lower-magnitude vectors.

Still, the common way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. That's likely because most comparisons just need the best sense of a word, without any more subtle indicator of "unity of meaning".

A similar effect might be present in Doc2Vec vectors: lower-magnitude doc-vectors could be a hint that the document has more broad word-usage/subject-matter, while higher-magnitude doc-vectors suggest more focused documents. (I'd similarly have the hunch that longer documents may tend to have lower-magnitude doc-vectors, because they use a greater diversity of words, whereas small documents with a narrow set of words/topics may have higher-magnitude doc-vectors. But I have not specifically observed/tested this hunch, and any effect here could be heavily influenced by other training choices, like the number of training iterations.)

Thus, it's possible that the non-normalized vectors would be interesting for some clustering goals, like separating focused documents from more general documents. So again, after this longer analysis: I'd suggest trying it both ways to see if one or the other seems to work better for your specific needs.

Is there a reason to not normalize the document output vectors of Doc2Vec for clustering?

1 Answers