0
votes

I am trying to use SVM for a text classification problem. I have found an SVM implementation called SVM light and its derivative SVM multiclass (for classification problems with more than 2 classes). However I am really not able to understand the format of the file for training and testing the classifier. I understand that I need to create a feature vector (let us assume that I take each word in the document as a feature) and then for each document I have to specify its class, the features it contains (actually the index of the feature in the feature vector) and a feature value to create the train file. I am confused about this 'feature value'. What could it possibly be? Is it the count of that feature in this document? Or is it something else? The example train file that the website contains do not have integers as feature values which indicates that it is not the frequency which would form the feature value

Also I was wondering if there is some tool/software to create this train file from a simple document. I generally work with Java; so some package in Java to do this would also be good enough for me. I tried searching the Google but could not find anything relevant.

I would also like to know if there is some other better way to use SVM for text classification.

Any help in this regard would be greatly appreciated.

1

1 Answers

2
votes

One can use simple binary features (did the word occur or not?), or simple count. But you probably want to scale the simple counts by a logarithm of the count (more frequent words are more important but a word occurring 10x is not 10x more important than a word occurring once).

Also you can weigh the counts by taking into account how often the words occur in all documents, etc (even if the word the will be frequent in a document, it is not really saying much about the document as it is very frequent in general). Have a look at at tf-idf.

Is SVM the right choice? I would say that finding the right features is more important than the exact algorithm, especially in early stages.