What is the effect of assigning the same label to a bunch of sentences in doc2vec? I have a collection of documents that I want to learn vectors using gensim for a "file" classification task where file refers to a collection of documents for a given ID. I have several ways of labeling in mind and I want to know what would be the difference between them and which is the best -
Take a document d1, assign label
doc1to the tags and train. Repeat for othersTake a document d1, assign label
doc1to the tags. Then tokenize document into sentences and assign labeldoc1to its tags and then train with both full document and individual sentences. Repeat for others
For example (ignore that the sentence isn't tokenized) -
Document - "It is small. It is rare"
TaggedDocument(words=["It is small. It is rare"], tags=['doc1'])
TaggedDocument(words=["It is small."], tags=['doc1'])
TaggedDocument(words=["It is rare."], tags=['doc1'])
- Similar to above, but also assign a unique label for each sentence along with
doc1. The full document has the all the sentence tags along withdoc1.
Example -
Document - "It is small. It is rare"
TaggedDocument(words=["It is small. It is rare"], tags=['doc1', 'doc1_sentence1', 'doc1_sentence2'])
TaggedDocument(words=["It is small."], tags=['doc1', 'doc1_sentence1'])
TaggedDocument(words=["It is rare."], tags=['doc1', 'doc1_sentence2'])
I also have some additional categorical tags that I'd be assigning. So what would be the best approach?