0
votes

I'm attempting to build a java application that trains an SVM model on a set of text documents and categorizes new documents based on the model. I have looked around a lot for packages in java that can do this and found the libsvm implementation the best.

1) My training input is essentially a text file that has the document text and the correct label. I understand that the libsvm package currently works only on numerical data which means I will have to convert my text file and the features(words) to a numerical form. Is TF-IDF a good way to do this? Is there a java library that can generate the TF-IDF?

2) The data has to be fed into the model in the form

<class label> <feature 1>:<value 1> <feature 2>:<value 2> ...... <feature n>:<value n>

In my case the feature is a word in the document and the value is the TF-IDF value. Is my interpretation right?

Are there any similar examples where libsvm has been used? I have done some searching but had no luck whatsoever!

1

1 Answers

0
votes

There are several examples. You could check out the rcv1 data set on the LIBSVM data set page. This is a document classification data set (already in TF-IDF format in LIBSVM representation). Many papers on the subject exist, such as Text Categorization with Support Vector Machines by Joachims.