Support Vector Machines understanding

Question

Recently,I have been going through lectures and texts and trying to understand how SVM's enable use to work in higher dimensional space.

In normal logistic regression,we use the features as it is..but in SVM's we use a mapping which helps us attain a non linear decision boundary.

Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?

We do this with the help of kernel.

Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)

Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?

I have spent lot of time trying to understand these..but in vain.Any help would be much apppreciated!

Thanks in advance :)

kernel is just an operation which must satisfy some predefined properties (I don't want to list them, you can find it by yourself). In linear case kernel - dot product, in nonlinear case it's substituted by (let's say) gaussian kernel. Dot product is measure of similarity in some sense too, because you get bigger results of dot products between two vector if angle between them decreases. — Ibraim Ganiev
Why do we need a similarity measure when we're using an svm.. From what I've understood.. With a kernel we can find non linear decision boundaries by using a higher dimension feature vector. — Sridhar Thiagarajan
Similarity is an inverse of distance. For linear cases, the distance function is simple Pythagorean distance, implemented with linear vector operations. The "kernel trick" applies a non-linear distance function. Another way to think of it is that the kernel trick searches for a distance metric that will transform the space to where the separating hyperplane is linear. — Prune
One way to do this is to add a dimension to the space, and then search for a function which will place all the "plus" points on the positive side of this new dimension, and all the "minus" points on the negative side. The Gaussian process is especially good at finding the function that will handle this search. — Prune

lejlot lejlot · Accepted Answer · 2015-10-15T16:06:22

Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?

Even using a kernel you still work with features, you can simply exploit more complex relations of these features. Such as in your example - polynomial kernel gives you access to low-degree polynomial relations between features (such as squares, or products of features).

Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)

Gaussian kernel maps your feature vector to the unnormalized Gaussian probability density function. In other words, you map each point onto a space of functions, where your point is now a Gaussian centered in this point (with variance corresponding to the hyperparameter gamma of the gaussian kernel). Kernel is always a dot product between vectors. In particular, in function spaces L2 we define classic dot product as an integral over the product, so

<f,g> = integral (f*g) (x) dx

where f,g are Gaussian distributions.

Luckily, for two Gaussian densities, integral of their product is also a Gaussian, this is why gaussian kernel is so similar to the pdf function of the gaussian distribution.

Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?

As mentioned before, kernel is a dot product, and dot product can be seen as a measure of similarity (it is maximized when two vectors have the same direction). However it does not work the other way around, you cannot use every similarity measure as a kernel, because not every similarity measure is a valid dot product.

Support Vector Machines understanding

2 Answers