I'm attempting to build a java application that trains an SVM model on a set of text documents and categorizes new documents based on the model. I have looked around a lot for packages in java that can do this and found the libsvm implementation the best.
1) My training input is essentially a text file that has the document text and the correct label. I understand that the libsvm package currently works only on numerical data which means I will have to convert my text file and the features(words) to a numerical form. Is TF-IDF a good way to do this? Is there a java library that can generate the TF-IDF?
2) The data has to be fed into the model in the form
<class label> <feature 1>:<value 1> <feature 2>:<value 2> ...... <feature n>:<value n>
In my case the feature is a word in the document and the value is the TF-IDF value. Is my interpretation right?
Are there any similar examples where libsvm has been used? I have done some searching but had no luck whatsoever!