Scikit learn - How to use SVM and Random Forest for text classification?

Question

I have a set of trainFeatures and a set of testFeatures with positive, neutral and negative labels:

trainFeats = negFeats + posFeats + neutralFeats
testFeats  = negFeats + posFeats + neutralFeats

For example, one entry inside the trainFeats is

(['blue', 'yellow', 'green'], 'POSITIVE')

the same for the list of test features, so I specify the labels for each set. My question is how can I use the scikit implementation of Random Forest classifier and SVM to get the accuracy of this classifier altogether with precision and recall scores for each class? The problem is that I am currently using words as features, while from what I read these classifiers require numbers. Is there a way I can achieve my purpose without changing functionality? Many thanks!

dnll dnll · Accepted Answer · 2014-02-23T23:23:44

You can look into this scikit-learn tutorial and especially the section on learning and predicting for how to create and use a classifier. The example uses SVM, however it is simple to use RandomForestClassifier instead as all classifiers implement the fit and predict methods.

When working with text features you can use CountVectorizer or DictVectorizer. Take a look at feature extraction and especially section 4.1.3.

You can find an example for classifying text documents here.

Then you can get the precision and recall of the classifier with the classification report.

Scikit learn - How to use SVM and Random Forest for text classification?

1 Answers