0
votes

I have just started with familiarizing myself with SVM and have the following questions regarding SVMs and Kernels more specifically:

(1) If I understand the it correctly, the decision boundary is always linear. Kernels are used among others to map from the input space to the feature space, where possibly the previously linearliy not separatable data is now linearly separable. If the decision boundary is always linear though, how comes in some papers the talk is about "non linear decision boundaries" (e.g. in "A User's Guide to Support Vector Machines" by Ben-Hur et al., page 3)?

(2) Is there a possibility to know which Kernel to apply for which dataset, i.e. indications which Kernel might lead to linear separability in the feature space?

(3) It is often stated that an advantage of using a Kernel is to reduce the computational complexity. Now given our map $\phi$ is as follows: $\phi(x)^T \phi(z)$ = $(x_1^2, \sqrt{2}x_1*x_2)^T(z_1^2, \sqrt{2}z_1*z_2,z_2^2)$ for the two-dimensional vectors x and z, and this map can be written as the kernel $(x^T*z)^2$. Is the computational advantage the reduced number of operations (e.g. multiplications) which have to be performed, and the fact that using the kernel implies using the dot product in the input space but not in the feature space?

(4) Is the reason why the Kernel contains a scalar multiplication of two input vectors following from the fact that the weight vector can be written as the function of the input vectors?

Any help appreciated...

1

1 Answers

2
votes

(It should be on CrossValidated but I'll answer anyway)

1) Boundaries are always linear. The boundaries will be linear in the kernel space, but non-linear in the original space.

2) Without knowing the data, no. A rbf kernel works often really well as it has a parameter "alpha" that can be tuned in a cross-validation loop. I'd recommand using this if you don't know the data beforehand and you don't know what to use.

3) I have mixed tought about this. Using a kernel reduce the computation complexity (because of the kernel trick, see theory) vs doing the mapping by hand. But it does not reduce the complexity vs linear-SVM (I didn't know what you implied).

From what I recall without having to consult anything, as the searching of the boundary rely on a dot product of the learning data instead of mapping everything by hand you can only compute the kernel matrix and the rest is linear programming.

4) See (3)

I should consult my university notes for more detailed anwsers. Tell me if you need more.

EDIT : Answer to your comment.

(2) I meant knowing about the data more than knowing the data. If you know that the separation between the two classes is something like a circle http://mikedeff.in/MLIntro.PNG, you know that your mapping will be something like a1^2 etc.

In fact rbf kernels can express many cases of separation between the classes by the tuning of the parameter. (I may be wrong about that, but I always used rbf kernels to make things working).

(3) So the expression of the SVM is something like :

y(x) = f(x_i'.x_j)

So as you know you have the dot product x_i'.x_j, you could do a mapping with phi(o) your non-linear function. You have the kernel : K(o_1, o_2) = phi'(o_1).phi(o_2) and you have :

y(x) = f(K(x_i, x_j)) 

So instead if you use for example a Gaussian kernel K(o_1, o_2) = exp( -(o_1 - o_2)' (o_1 - o_2) / sigma) you don't have to compute the phi(o) nor the dot product between phi(x_i) and phi(x_j) (that's what I meant "by hand") as the dot product is implied in the expression of the kernel. That is thus less costly. You were right.

(4) is derived from the expression of y(x) = ... In fact when you use the dot product it is to make a measure of similarity between two objects (x_i and x_i). You can use a kernel with any method that uses a dot product (like PCA, ...).