I am learning about Doc2Vec and the gensim library. I have been able to train my model by creating a corpus of documents such as
LabeledSentence(['what', 'happens', 'when', 'an', 'army', 'of', 'wetbacks', 'towelheads', 'and', 'godless', 'eastern', 'european', 'commies', 'gather', 'their', 'forces', 'south', 'of', 'the', 'border', 'gary', 'busey', 'kicks', 'their', 'butts', 'of', 'course', 'another', 'laughable', 'example', 'of', 'reagan-era', 'cultural', 'fallout', 'bulletproof', 'wastes', 'a', 'decent', 'supporting', 'cast', 'headed', 'by', 'l', 'q', 'jones', 'and', 'thalmus', 'rasulala'], ['LABELED_10', '0'])`
note that this particular document has two tags, namely 'LABELED_10' and '0'.
Now after i load my model and perform
print(model.docvecs.most_similar("LABELED_10"))
i get
[('LABELED_107', 0.48432376980781555), ('LABELED_110', 0.4827481508255005), ('LABELED_214', 0.48039984703063965), ('LABELED_207', 0.479473352432251), ('LABELED_315', 0.47931796312332153), ('LABELED_307', 0.47898322343826294), ('LABELED_124', 0.4776897132396698), ('LABELED_222', 0.4768940210342407), ('LABELED_413', 0.47479286789894104), ('LABELED_735', 0.47462597489356995)]
which is perfect ! as i get all the tags most similar to LABELED_10.
Now i would like to have a feedback loop while training my model. So if i give my model a new document, i would like to know how good or bad the model's classification is before tagging and adding that document to my corpus. How would i do that using Doc2Vec? So how do i know whether the documents for LABELED_107 and LABELED_10 are actually similar or not. Here is one approach that i have in mind. Here is the code for my random forest classifier
result = cfun.rfClassifer(n_estimators, trainingDataFV, train["sentiment"],testDataFV)
and here is the function
def rfClassifer(n_estimators, trainingSet, label, testSet):
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
forest = RandomForestClassifier(n_estimators)
forest = forest.fit(trainingSet, label)
result = forest.predict(testSet)
return result
and finally i can do
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})
output.to_csv("../../submits/Doc2Vec_AvgVecPredict.csv", index=False, quoting=3)
Feedback process
Keep a validation set which is tagged correctly.
Feed the tagged validation set to the classifier after removing the tags and save the result in a csv.
Compare the result with another csv that has the correct tags.
For every mismatch, add those documents to the labeled training set and train the model again.
Repeat for more validation sets.
Is this approach correct? Also, can i incrementally train the doc2vec model? Lets say that initially i trained my doc2vec model with 100k tagged docs. Now after the validation step, i need my model to be trained on a further 10k documents. Will i have to train my model from the very beginning ? Meaning will i need to train my model on the initial 100k tagged docs again?
I would really appreciate your insights.
Thanks