0
votes

I have a big data set that I use to train a naive classifier using Apache Mahout. I use the classifier to classify a bunch of documents (this is like my test set). The way I classify documents is as follows:

I find the normalized tf-idf vectors for the test document. For finding the idf I only consider the test documents, not training.

However after classifying the test documents, I'll receive more documents to classify and I need to calculate the tf-idf for the new documents first. One solution is to re-calculate the tf-idf for all the test documents (old ones as well as the new ones) and then re-classify them all. In this scenario every time I receive a new document I need to re-calculate the tf-idf. My question is whether there is a better solution to do this online classification?

1

1 Answers

0
votes

When receiving a new document there are multiple approaches. Your approach seems unpractical. I would suggest 2 approaches for calculating tf-idf only for the new document and then directly classify:

  1. calculate idf using ALL documents (new one and all previously seen documents)
  2. use the already on the test set calculated idf

Try approach 2 and 3 on the test set while splitting the test set into two and validate which approach is better suited to your document types.