1
votes

I am learning about Doc2Vec and the gensim library. I have been able to train my model by creating a corpus of documents such as

LabeledSentence(['what', 'happens', 'when', 'an', 'army', 'of', 'wetbacks', 'towelheads', 'and', 'godless', 'eastern', 'european', 'commies', 'gather', 'their', 'forces', 'south', 'of', 'the', 'border', 'gary', 'busey', 'kicks', 'their', 'butts', 'of', 'course', 'another', 'laughable', 'example', 'of', 'reagan-era', 'cultural', 'fallout', 'bulletproof', 'wastes', 'a', 'decent', 'supporting', 'cast', 'headed', 'by', 'l', 'q', 'jones', 'and', 'thalmus', 'rasulala'], ['LABELED_10', '0'])`

note that this particular document has two tags, namely 'LABELED_10' and '0'.

Now after i load my model and perform

print(model.docvecs.most_similar("LABELED_10"))

i get

[('LABELED_107', 0.48432376980781555), ('LABELED_110', 0.4827481508255005), ('LABELED_214', 0.48039984703063965), ('LABELED_207', 0.479473352432251), ('LABELED_315', 0.47931796312332153), ('LABELED_307', 0.47898322343826294), ('LABELED_124', 0.4776897132396698), ('LABELED_222', 0.4768940210342407), ('LABELED_413', 0.47479286789894104), ('LABELED_735', 0.47462597489356995)]

which is perfect ! as i get all the tags most similar to LABELED_10.

Now i would like to have a feedback loop while training my model. So if i give my model a new document, i would like to know how good or bad the model's classification is before tagging and adding that document to my corpus. How would i do that using Doc2Vec? So how do i know whether the documents for LABELED_107 and LABELED_10 are actually similar or not. Here is one approach that i have in mind. Here is the code for my random forest classifier

result = cfun.rfClassifer(n_estimators, trainingDataFV, train["sentiment"],testDataFV)

and here is the function

def rfClassifer(n_estimators, trainingSet, label, testSet):

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

    forest = RandomForestClassifier(n_estimators)
    forest = forest.fit(trainingSet, label)
    result = forest.predict(testSet)

    return result

and finally i can do

output = pd.DataFrame(data={"id": test["id"], "sentiment": result})

output.to_csv("../../submits/Doc2Vec_AvgVecPredict.csv", index=False, quoting=3)

Feedback process

  1. Keep a validation set which is tagged correctly.

  2. Feed the tagged validation set to the classifier after removing the tags and save the result in a csv.

  3. Compare the result with another csv that has the correct tags.

  4. For every mismatch, add those documents to the labeled training set and train the model again.

  5. Repeat for more validation sets.

Is this approach correct? Also, can i incrementally train the doc2vec model? Lets say that initially i trained my doc2vec model with 100k tagged docs. Now after the validation step, i need my model to be trained on a further 10k documents. Will i have to train my model from the very beginning ? Meaning will i need to train my model on the initial 100k tagged docs again?

I would really appreciate your insights.

Thanks

1
Do these labels mean anything or they are just sentence index? Also you have sentence sentiment independently for each sentence, is it?kampta
each sentence has its own sentiment. the label can be sentence index, as in this example but i have also tried with multiple labels. meaning a label for index and another for sentimentAbtPst

1 Answers

2
votes

From my understanding of Doc2Vec, you can retrain your model as long are you have all the previous vectors with the model.

And from the following link, you can see how they perform their validation: https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis