0
votes

I want to create a synthetic dataset consisting of 2 classes and 3 features for testing a hyperparameter optimization technique for a SVM classifier with a RBF kernel. The hyperparameters are gamma and C (the cost).

I have created my current 3D synthetic dataset as follows:

  1. I have created 10 based points for each class by sampling from a multivariate normal distribution with mean (1,0,0) and (0,1,0), respectively, and unit variance.

  2. I have added more points to each class by picking a base point at random and then sampling a new point from a normal distribution with mean equal to the chosen base point and variance I/5.

It would be a very cool thing if I could determine the best C and gamma from the dataset (before running SVM), so that I can see if my optimization technique provides me the best parameters in the end.

Is there a possibility to calculate the best gamma and C parameter from the synthetic dataset described above?

Or else is there a way to create a synthetic dataset where the best gamma and C parameters are known?

1

1 Answers

1
votes

Very interesting question, but the answer is no. It is completely data specific, even knowing exactly the distributions, unless you have an infinite sample, it is mathematicaly impossible to prove best C/gamma as SVM in the end is purely point-based method (as opposed to density estimation based). Typical comparison is done in a different scenario - you take real data, and fit hyperparams using other techniques, like gaussian processes (bayesian optimization) etc, which generate baseline (and probably will get to optimal C and gamma too, or at least realy close to them). In the end looking for best C and gamma is not complex problem, thus simply run good techniqe (like bayesopt) for a longer time, and you will get your optimas to compare against. Furthermore, remember that the task of hyperparams optimization is not to find a particular C and gamma, it is to find hyperparams yielding best results, and in fact, even for SVM, there might be many sets of "optimal" C and gammas, all yielding the same results (in terms of your finite dataset) despite being very far away from each other.