2
votes

I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?

1
do you have "one dataset" or "10 subsets"? Not sure what you mean "It is obviously that we do not have fixed training and test data." - so you have "one dataset" or not?Christian Cerri
@Christian Cerri: I have one dataset which is then partitioned into 10 subsets for 10-fold cross-validation. I would like to do cross-validation on my original dataset.Deja Vu
Cross Validation automatically divides your set - eg 10 fold CV splits data into 10 sets and uses 9 to predict 1 in all possible combinations. Have you tried making one dataset (or using the original undivided set) and running 10-fold CV on it? You don't need to divide your data, something like LIBSVM does it already.Christian Cerri
Maybe you mean that you want to reserve some data for final testing - in which case use 90% of the original set for cv, and then test on the reserved 10% using the hyperparameters found by cv.Christian Cerri
I want to do cross-validation on the entire data. I was wondering how to do parameter selection in this case. If I have two separate datasets: one training set and one test set, or one development set and one test set, I will turn hyperparameters on the training set or development set and then use the best parameters on the test set. But in case of cross-valiation, I do not know what the proper procedure is.Deja Vu

1 Answers

0
votes

Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.

This process is built-in in most ML tools already, like with this short example in R, using the caret library:

library(caret)
library(lattice)
library(doMC)
registerDoMC(3)

model <- train(x = iris[,1:4], 
            y = iris[,5], 
            method = 'svmRadial', 
            preProcess = c('center', 'scale'),
            tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
            trControl = trainControl(method = 'repeatedcv', 
                                        number = 10, 
                                        repeats = 10, 
                                        returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
                                        allowParallel = T))

# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results) 
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3)) 
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample) 

Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).

BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as @christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.