Given that I have a set of training text documents and a set of testing text documents. Two sets are very large so using weka is not a good choice since it costs time a lot. Hence, I use mahout - a scalable machine learning and data mining framework (http://mahout.apache.org/). Next, I use mahout to convert training documents into mahout vector (set ngram = 1). Here I have a mahout vector representing for training documents in which the size of the each vector is the number of attributes or features and each number in that vector is the frequency of word in training documents (use tf instead of tf-idf). Does anyone know how to convert testing documents based on the features or attributes of training data I built before in mahout?
1
votes
1 Answers
0
votes
The "conversion" you refer to is actually a "prediction" .. no? Given that you have already trained the data - presumably you have a model for classification available.
You may use the command line facilities from mahout here:
http://mahout.apache.org/users/basics/creating-vectors-from-text.html