Machine Learning - Support Vector Machines

Question

I came across an SVM example, but I didn't understand. I would appreciate it if somebody could explain how the prediction works. Please see the explanation below:

The dataset has 10,000 observations with 5 attributes (Sepal Width, Sepal Length, Petal Width, Petal Length, Label). The label gets positive if it belongs to the I.setosa class, and negative if belongs to some other class.

There are 6000 observations for which the outcome is known (i.e. they belong to the I.setosa class, so they get positive for the label attribute). The labels for the remaining 4000 are unknown, so the label was assumed to be negative. The 6000 observations and 2500 randomly selected observations from the remaining 4000 form the set for the 10-fold cross validation. SVM (10 fold cross validation) is then used for machine learning on the 8500 observations and the ROC is plotted.

Where are we predicting here? The set has 6000 observations for which the values are already known. How did the remaining 2500 get negative labels? When SVM is used, some observations that are positive get negative prediction. The prediction didn't make any sense to me here. Why are those 1500 observations excluded.

I hope my explanation is clear. Please let me know if I haven't explained anything clearly.

Why don't you post the source of the SVM analysis (the data set itself is classic, en.wikipedia.org/wiki/Iris_flower_data_set) — dan3
I didn't come across this example online. This is just an example that came up during a discussion on SVMs. — acc_so

Dave Dave · Accepted Answer · 2013-06-20T12:21:25

I think that the issue is a semantic one: you refer to the set of 4000 samples as being both "unknown" and "negative" -- which of these apply is the critical difference.

If the labels for the 4000 samples are truly unknown, then I'd do a 1-class SVM using the 6000 labelled samples [c.f. validation below]. And then the predictions would be generated by testing the N=4000 set to assess whether or not they belong to the setosa class.

If instead, we have 6000 setosa, and 4000 (known) non-setosa, we could construct a binary classifier on the basis of this data [c.f. validation below], and then use it to predict setosa vs. non on any other available non-labelled data.

Validation: Usually as part of the model construction process you will take only a subset of your labelled training data and use it to configure the model. For the unused subset, you apply the model to the data (ignoring the labels), and compare what your model predicts against what the true labels are in order to assess error rates. This applies both to the 1-class and the 2-class situations above.

Summary: if all of your data are labelled, then usually one will still make predictions for a subset of them (ignoring the known labels) as part of the model validation process.

Machine Learning - Support Vector Machines

2 Answers