0
votes

I am using the one class SVM classifier OneClassSVM from Scikit to determine outliers in a dataset. My dataset has 30000 samples with 1024 variables. I use 10 percent of those as training data.

clf=svm.OneClassSVM(nu=0.001,kernel="rbf",gamma=1e-5)
clf.fit(trset)
dist2hptr=clf.decision_function(trset)
tr_y=clf.predict(trset)

As above, I calculate the distance of each sample to the decision function using the decision_function(x) function. When I compare the prediction results and the distance results, it always show positive distance for samples marked as +1 in predict output and negative distance values for samples marked as -1.

I thought distance doesn't have a sign since it does not deal with direction. I want to understand how the distances are calculated in OneClassSV scikit classifier. Does the sign simply represent that the sample lies out of the decision hyperplane calculated by the SVM ?

Please help.

1

1 Answers

4
votes

sklearn's OneClassSVM is implemented from the following paper as explained here:

Bernhard Schölkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. 2001. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 13, 7 (July 2001), 1443-1471. DOI: https://doi.org/10.1162/089976601750264965

Let's have a look at the abstract of that paper here:

Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.

We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.

So the abstract defines the function f of OneClassSVM which is followed by sklearn.