1
votes

I know, there are multiple questions to this, but not a single one to my particular problem.

I'll simplify my problem in order to make it more clear. Lets say I have multiple sentences from an english document and I want to classify them using a one class svm (in libsvm) in order to be able to see anomalities (e.g. a german sentence) afterwards.

For training: I have samples of one class only (lets assume other classes are not existing beforehand). I extract all 3-grams (so the feature space includes max. 16777216 different features) and save them in libsvm format (label=1, just in case that matters)

Now I want to estimate my paramters. I tried to use the grid.py using additional parameters, however, the runtime is too big for rbf kernels. So I try using linear kernels (therefore, the grid.py may be changed in order to use only one value of gamma, as it does not matter for linear kernels).

Whatsoever, the smallest c grid.py tests will shown as the best solution (does -c matter for linear kernels?).

Furthermore, it does not matter how much I change the -n (nu) value, everytime the same relation between scores will be achieved (even though the number of support vectors changes). Scores are gathered by using the python implementation. (relation between scores means, that e.g. at first they are -1 and -2, i change nu and afterwards they are e.g. -0.5 and -1, so if i sort them, the same order always appears, as in this example):

# python2
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from svmutil import *
y,x = svm_read_problem("/tmp/english-3-grams.libsvm") # 5000 sentence samples
ym,xm = svm_read_problem("/tmp/german-3-grams.libsvm") # 50 sentence samples
m = svm_train(y,x,"-s 2 -t 2 -n 0.5");

# do the prediction in one or two steps, here is one step:
p_l, p_a, p_v = svm_predict(y[:100]+ym[:100],x[:100]+xm[:100],m)

# p_v are our scores.
# let's plot a roc curve
roc_ret = roc_curve([1]*100+[-1]*100,p_v)
plt.plot(roc_ret[0],roc_ret[1])
plt.show()

Here, everytime the exact same roc-curve is achieved (even though -n is varied). Even if there is only 1 support vector, the same curve is shown.

Hence, my question (let's assume a maximum of 50000 samples per training): - why is -n not changing anything for the one class training process? - what parameters do i need to change for a one class svm? - is a linear kernel the best approach? (+ with regard to runtime) and rbf kernel parameter grid search takes ages for such big datasets - liblinear is not being use because I want to do anomaly detection = one class svm

Best regards, mutilis

1
Why don't you use feature selection to reduce feature space and improving training time (and grid-search time) in this way? Grid-search time depends on step size for parameters and size of the feature space...rzo1
@rzo throwing away features isn't a good way i think. but i found, that liblinear is able to do very fast calculations, even with a huge set + huge amount of features. however, this will end up in an linear classifier/anomaly detector.mutilis
Literature suggests feature selection, e.g. InformationGain for TextClassification: courses.ischool.berkeley.edu/i256/f06/papers/… You can give it a try and compare your results with and with-out feature selection. It will speed up the process and you can go for RBF kernels...rzo1

1 Answers

4
votes

The performance impact is a result of your huge feature space of 16777216 elements. This results in very sparse vectors for elements like german sentences.

A study by Yang & Petersen, A Comparative Study on Feature Selection in Text Categorization shows, that aggressive feature-selection does not necessarily decrease classification accuracy. I achieved similar results, while performing text classification for (medical) German text documents.

As stated in the comments, LIBLINEAR is fast, because it is build for such sparse data. However, you end up with a linear classifier with all its pitfalls and benefits.

I would suggest the following strategy:

  1. Perform aggressive feature selection (e.g. with InformationGain) with a remaining feature-space of N

  2. Increase N stepwise in combination with cross-validation and find the best maching N for your data.

  3. Go for a grid-search with the N found in 2.

  4. Train your classifier with the best matching parameters found in 3. and the N found in 2.