1) Could anyone explain why the number of feature space after mapping
is corresponding to the derivatives of the kernel? I am not clear on
this part.
It has nothing to do with being differentiable, linear kernel is also infinitely differentiable and does not map to any higher dimensional space, whoever told you that it is the reason -- lied or did not understand the math behind it. The infinite dimension comes from the mapping
phi(x) = Nor(x, sigma^2)
in other words you are mapping your point into function being a Gaussian distribution, which is an element of L^2 space, infinitely dimension space of continuous function, where scalar product is defined as an integral of multiplication of functions, so
<f,g> = int f(a)g(a) da
and as such
<phi(x),phi(y)> = int Nor(x,sigma^2)(a)Nor(y,sigma^2)(a) da
= X exp(-(x-y)^2 / (4sigma^2) )
for some normalising constant X
(which is completely unimportant). In other words, Gaussian kernel is a scalar product between two functions, which have infinite dimensions.
2) There are many non-linear kernels, such as polynomial kernel, and I
believe they are also able to map the data from a low dimensional
space to an infinite dimensional space. But why the RBF kernel is more
popular then them?
Polynomial kernel maps into feature space with O(d^p)
dimensions, where d
is input space dimension and p
is polynomial degree, so it is far from being infinite. Why is Gaussian popular? Because it works, and is quite easy to use and fast to compute. From theoretical point of view it also has guarantees of learning any arbitrary set of points (with small enough variances used).