0
votes

I have data in an excel file that I need to use to perform multi-label classification using SVM. It has two columns as shown below. 'tweet' - A,B,C,D,E,F,G and 'category' = X,Y,Z

tweet category

A X

B Y

C Z

D X,Y

E Y,Z

F X,Y,Z

G X,Z

Given a tweet, I want to train my model to predict the category it belongs to. Both the tweets and categories are text. I am trying to use Weka's LibSVM classifier to do the classification as I read it does multi-label classification. I converted the csv file to arff file and loaded it in Weka. I then ran the "LibSVM" classifier. However, I am getting very poor results as shown below. Any idea what I am doing wrong ? Is multi-label text classification even possible with "LibSVM" ?

Correctly Classified Instances 82 25.9494 %

Incorrectly Classified Instances 234 74.0506 %

Kappa statistic 0

Mean absolute error 0.0423

Root mean squared error 0.2057

Relative absolute error 89.9823 %

Root relative squared error 134.3377 %

Total Number of Instances 316

1

1 Answers

0
votes

SVM can definitely be used for multiclass classification. I have not used Weka LibSV before, but you if you already haven't you would need to do some data cleaning before you input text for any sort of classification. The type of cleaning also depends on your classification task, but you can look into the following techniques which are used in practice for text analysis:

1) Remove twitter handles from your text

2) Remove stop words or words that you know for sure do not impact your classifications. Maybe you can only preserve pronouns and remove any other words. You can use POS tagging to perform this task. More info here

3) Remove punctuations

4) Use n-grams to get contextual meaning out of your text. This site has some good explanation of how that works. Essentially, this would mean that you would treat a sequence of words as a feature rather than using a single word as a data point in your model. Mind you this might impact the amount of memory your model occupies up while training.

5) Remove words that either occur too frequently or do not occur too frequently in your data set.

6) Balance your classes or categories in your case. This means before training your model, make sure the training data has a similar number of X,Y and Z categories. It is possible that your data had a lot of tweets that classify to X and Y but in your test set you had tweets that mostly mapped to the Z category.