1
votes

I'm trying to Classify a Text Document into Categories , for example :

Document 1 : " Basketball is a good sport " ---> Category : Sport
Document 2 : " World war 2 .. " ---> Category : History
...

My gool is to create a Java interface with a SVM Algorithm !
So, I should use SVM Java Library , I found two :

  • SVMLIGH
  • LIBSVM

Should I use the first one or the second?

I had do many research , and I found that I should do two things :

  • I should prepare a training file.
    In SVM there is a special format for this file ( Example : 1 1:317.5 )
    But the question is : From what I Should Generate this file ? From the documents only ? Or From something else ?

  • I should have a test file, that's mean a new document to classify. Should I transform the new document to classify into SVM Test file format?

That's correct?

Please guide me I'm truly lost and I don't know what I should do ! PLZ

1

1 Answers

1
votes

yes, you should change the format to svm standard your svm classifier have no idea about text, first you should change your texts(train,test) to standrad format you can start your classifier with Weka, weka have simple GUI & you can classify your datasets with few clicks when you get confidence about your classifier & it's accuracy then implement it in java you can use Weka in your java code too

PS: 1- WEKA Text Classification for First Time & Beginner Users : http://www.youtube.com/watch?v=IY29uC4uem8

2- http://www.cs.waikato.ac.nz/ml/weka/