2
votes

I plan on using scikit svm for class prediction. I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification. Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward. I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment. How do I achieve that with the available score function?

Cheers, EL

1

1 Answers

5
votes

You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:

import numpy as np
from sklearn import svm

clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
    is_train = idx != i
    clf.fit(observations[is_train, :], labels[is_train])
    preds[i] = clf.predict(observations[i, :])

Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:

from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)

See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.

Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.