I have a big data set that I use to train a naive classifier using Apache Mahout. I use the classifier to classify a bunch of documents (this is like my test set). The way I classify documents is as follows:
I find the normalized tf-idf vectors for the test document. For finding the idf I only consider the test documents, not training.
However after classifying the test documents, I'll receive more documents to classify and I need to calculate the tf-idf for the new documents first. One solution is to re-calculate the tf-idf for all the test documents (old ones as well as the new ones) and then re-classify them all. In this scenario every time I receive a new document I need to re-calculate the tf-idf. My question is whether there is a better solution to do this online classification?