0
votes

I would like to classify a bunch of documents using Apache Mahout and by using a naive bayes classifier. I do all the pre-processing and convert my training data set into feature vector and then train the classifier. Now I want to pass a bunch of new instances (to-be-classified instances) to my model in order to classify them.

However, I'm under the impression that the pre-processing must be done for my to-be-classified instances and the training data set together? If so, how come I can use the classifier in real world scenarios where I don't have the to-be-classified instances at the time I'm building my model?

How about Apache Spark? Howe thing work there? Can I make a classification model and the use it to classify unseen instances later?

1

1 Answers

0
votes

As of Mahout 0.10.0, Mahout provides a Spark backed Naive Bayes implementation which can be run from the CLI, the Mahout shell or embedded into an application:

http://mahout.apache.org/users/algorithms/spark-naive-bayes.html

Regarding the classification of new documents outside of the training/testing sets, there is a tutorial here:

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

Which explains how to tokenize (using trival java native String methods), vectorize and classify unseen text using the dictionary and the df-count from the training/testing sets.

Please note that the tutorial is meant to be used from the Mahout-Samsara Environment's spark-shell, however the basic idea can be adapted and embedded into an application.