Using HBase to fetch data to calculate Text Similarities using Mahout

Question

In my Project we are trying to calculate the Text Similarity of a set of documents for which I am facing 2 issues.

I do not want to recalculate the Term Frequency of the documents I have previously calculated. e.g. I have 10 docs and I have calculated the Term Frequency and Inverse Document Frequency for all the 10 documents. Then I get 2 more documents. Now I do not want to calculate the Term Frequency for the already existing 10 documents but want to calculate the TF for the new 2 which have come in and then use the TF's for all the 12 documents and calculate the IDF for the 12 documents as a whole. How to calculate the IDF of all the documents without re-calculating the TF's of the existing docs again?
The number of documents might increase which means using the in memory approach (InMemoryBayesDatastore) might become cumbersome. What I want is to save the TF of all the documents in an HBASE table and when new documents arrive, I calculate the TF of the new documents, save them in the HBASE table and then I use this HBASE table to fetch the TF of all the documents to calculate the IDF. How can I use HBase to provide data to Mahout's Text Similarity instead of fetching it from the sequence file?

Tucker Tucker · Accepted Answer · 2012-07-04T04:51:11

I assume in your MR job you are reading form HDFS and outputting to Hbase. What I suggest, if I understand your problem correctly, is to calculate the TF for each document and store the Term as the rowkey, the qualifier can be the documentID, and a value can be the frequency (just a suggestion for your schema). You will have to do 1 MR job for each document, and you will only have to run the job once per document.

Do this for each document you are analyzing as they arrive.

Then run an a final MR job to compare all the documents one a per term (i.e. per row) basis. This will work for specific terms, but would get complicated with 'similar terms'. Then you'd want to run some sort of algorithm that would take into account perhaps the Levenshtein distance between terms, which can be complicated.

Using HBase to fetch data to calculate Text Similarities using Mahout

1 Answers