2
votes

Recently,I have been going through lectures and texts and trying to understand how SVM's enable use to work in higher dimensional space.

In normal logistic regression,we use the features as it is..but in SVM's we use a mapping which helps us attain a non linear decision boundary.

Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?

We do this with the help of kernel.

Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)

Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?

I have spent lot of time trying to understand these..but in vain.Any help would be much apppreciated!

Thanks in advance :)

2
kernel is just an operation which must satisfy some predefined properties (I don't want to list them, you can find it by yourself). In linear case kernel - dot product, in nonlinear case it's substituted by (let's say) gaussian kernel. Dot product is measure of similarity in some sense too, because you get bigger results of dot products between two vector if angle between them decreases. - Ibraim Ganiev
Why do we need a similarity measure when we're using an svm.. From what I've understood.. With a kernel we can find non linear decision boundaries by using a higher dimension feature vector. - Sridhar Thiagarajan
Similarity is an inverse of distance. For linear cases, the distance function is simple Pythagorean distance, implemented with linear vector operations. The "kernel trick" applies a non-linear distance function. Another way to think of it is that the kernel trick searches for a distance metric that will transform the space to where the separating hyperplane is linear. - Prune
One way to do this is to add a dimension to the space, and then search for a function which will place all the "plus" points on the positive side of this new dimension, and all the "minus" points on the negative side. The Gaussian process is especially good at finding the function that will handle this search. - Prune

2 Answers

1
votes

Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?

Even using a kernel you still work with features, you can simply exploit more complex relations of these features. Such as in your example - polynomial kernel gives you access to low-degree polynomial relations between features (such as squares, or products of features).

Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)

Gaussian kernel maps your feature vector to the unnormalized Gaussian probability density function. In other words, you map each point onto a space of functions, where your point is now a Gaussian centered in this point (with variance corresponding to the hyperparameter gamma of the gaussian kernel). Kernel is always a dot product between vectors. In particular, in function spaces L2 we define classic dot product as an integral over the product, so

<f,g> = integral (f*g) (x) dx

where f,g are Gaussian distributions.

Luckily, for two Gaussian densities, integral of their product is also a Gaussian, this is why gaussian kernel is so similar to the pdf function of the gaussian distribution.

Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?

As mentioned before, kernel is a dot product, and dot product can be seen as a measure of similarity (it is maximized when two vectors have the same direction). However it does not work the other way around, you cannot use every similarity measure as a kernel, because not every similarity measure is a valid dot product.

0
votes

just a bit of introduction about svm before i start answering the question. This will help you get overview about svm. Svm task is to find the best margin-maximing hyperplane that best separates the data. We have soft margin representation of svm which is also known as primal form and its equivalent form is dual form of svm. Dual form of svm makes the use of kernel trick.

kernel trick is partially replacing the feature engineering which is the most important step in machine learning when we have datasets that are not linear (eg. datasets in shape of concentric circles).

Now you can transform this dataset from non-linear to linear by both FE and kernel trick. By FE you can square each of the features in this dataset and it will transform into linear dataset and then you can apply techniques like logisitic regression which work best for linear data.

In kernel trick you can use the polynomial kernel whose general form is (a + x_i(transpose)x_j)^d where a and d are constants and d specifies the degree, suppose if the degree is 2 then we can say it is quadratic and likewise. now lets say we apply d =2 now our equation becomes (a + x_i(transpose)x_j)^2. lets the we 2 features in our original dataset (eg. for x_1 vector is [x_11,x__12] and x_2 the vector is [x_21,x_22]) now when we apply polynomial kernel on this we get we get 6-d vectors. Now we have transformed the features from 2-d to 6-d.Now you can see intuitively that higher your dimension of data better would svm work because it will eventually transform that features to a higher space. Infact the best case of svm is if you have high dimensionality, then go for svm.

Now you can see both kernel trick and Feature Engineering solve and transforms the dataset(concentric circle one) but the difference is we are doing FE explicitly but kernel trick comes implicitly with svm. There is also a general purpose kernel know as Radial basis function kernel which can be used when you don't know the kernel in advance. RBF kernel has a parameter (sigma) if the value of sigma is set to 1, then you get a curve that looks like gaussian curve.

You can consider kernel as a similarity metric only and can interpret if the distance between the points is less, higher will be the similarity.