In my Project we are trying to calculate the Text Similarity of a set of documents for which I am facing 2 issues.
I do not want to recalculate the Term Frequency of the documents I have previously calculated. e.g. I have 10 docs and I have calculated the Term Frequency and Inverse Document Frequency for all the 10 documents. Then I get 2 more documents. Now I do not want to calculate the Term Frequency for the already existing 10 documents but want to calculate the TF for the new 2 which have come in and then use the TF's for all the 12 documents and calculate the IDF for the 12 documents as a whole. How to calculate the IDF of all the documents without re-calculating the TF's of the existing docs again?
The number of documents might increase which means using the in memory approach (InMemoryBayesDatastore) might become cumbersome. What I want is to save the TF of all the documents in an HBASE table and when new documents arrive, I calculate the TF of the new documents, save them in the HBASE table and then I use this HBASE table to fetch the TF of all the documents to calculate the IDF. How can I use HBase to provide data to Mahout's Text Similarity instead of fetching it from the sequence file?