5
votes

I'm using the Python interface for libsvm, and what I'm noticing is that after selecting the best C and gamma parameters (RBF kernel) using grid search, when I train the model and cross validate it (5 fold, if it's relevant), the accuracy that I receive is the same as the ratio of labels in my training data set.

I have 3947 samples, and 2898 of them have label -1, and the rest have label 1. So that's 73.4229% of the samples.

And when I train the model and cross validate it 5 folds, this is what I get -

optimization finished, #iter = 1529
nu = 0.531517 obj = -209.738688,
rho = 0.997250 nSV = 1847, nBSV = 1534
Total nSV = 1847
Cross Validation Accuracy = 73.4229%

Does this mean that the SVM is not taking the features into account? Or that it's the data at fault here? Are they both related at all? I'm just not able to get it past the 73.4229 number. Also, the number of support vectors is supposed to be much less than the size of the dataset, but in this case, it doesn't seem so.

In general, what does it mean when the cross validation accuracy is the same as the ratio of labels in the dataset?

1

1 Answers

6
votes

Your data set is unbalanced, which means that a large percentage is of the same class. This results in what is called a default or majority-class classifier, where a high accuracy is achieved by simply classifying everything as part of the majority class. So you're right that it's not taking the features into account, because of the data.

The libsvm README suggests varying the penalty weights to deal with this. And here's a related question: https://stats.stackexchange.com/questions/20948/best-way-to-handle-unbalanced-multiclass-dataset-with-svm

For more information about unbalanced data, see section 7 of A User's Guide to Support Vector Machines.