gensim doc2vec - How to infer label

Question

I am using gensim's doc2vec implementation and I have a few thousand documents tagged with four labels.

yield TaggedDocument(text_tokens, [labels])

I'm training a Doc2Vec model with a list of these TaggedDocuments. However, I'm not sure how to infer the tag for a document that was not seen during training. I see that there is a infer_vector method which returns the embedding vector. But how can I get the most likely label from that?

An idea would be to infer the vectors for every label that I have and then calculate the cosine similarity between these vectors and the vector for the new document I want to classify. Is this the way to go? If so, how can I get the vectors for each of my four labels?

gojomo gojomo · Accepted Answer · 2018-08-23T19:11:05

The infer_vector() method will train-up a doc-vector for a new text, which should be a list-of-tokens that were preprocessed just like the training texts).

And, as you've noted, model.docvecs['my_tag'] will get the pre-trained doc-vector for one of the tags that was known during training.

Checking the similarity of a new vector, against the vectors for all known-tags, is a reasonable baseline way to see what existing tags a new document is similar-to. The closest tag, or closest few tags, might be reasonable labels for an unknown document, as a sort of 'nearest-neighbor' approach.

But, note that the original/usual Doc2Vec approach is to give each document a unique ID, and let each ID-tag get its own vector. And then, perhaps, use those vectors with known-labels to train some other classifier that maps vectors to labels. (This might work better in some cases, if the "areas of the doc-vector space" that humans associate with a particular label aren't neat radiuses around a single centroid point for each label.)

Your approach of using, or adding, known-labels as doc-tags can often help. But also note that if you're only using 4 unique tags across thousands of documents, that's functionally very similar to just training the model with 4 giant documents – which may not be good at positioning those 4 vectors in a large-dimensional space (>4 dimensions), because there's not so much of the variety/subtle-contrasts that are needed to nudge the trained vectors into useful arrangements. (Typical published Doc2Vec work uses tens-of-thousands to millions of unique docs and doc-tags.)

gensim doc2vec - How to infer label

2 Answers