3
votes

I am experimenting with different kinds of non-linear kernels and am trying to interpret the learned models, which led me to the following question: Is there a generic method for getting the primal weights of a non-linear Support Vector machine similar to how this is possible for linear SVMs (see related question)?

Say, you have three features a, b, c and the generated model of an all-subsets/polynomial kernel. Is there a way to extract the primal weight of those subsets, e.g., a * b and a^2?


I've tried extending the method for linear kernels, where you generate output for the following samples:

 a, b, c
[0, 0, 0]
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]

If I use the same approach for the all-subsets kernel, I can generate some more samples:

 a, b, c
[1, 1, 0]
[1, 0, 1]
...

Next, to calculate the primal weight for a * b, I analyse the predictions as follows: [1, 1, 0] - ([1, 0, 0] + [0, 1, 0] + [0, 0, 0]).

The problem I see with this is that it requires a prohibitive number of samples, doesn't address the subsets such as a^2 and it doesn't generalise to other non-linear kernels.

2

2 Answers

5
votes

No. I don't claim to be the end-all-be-all expert on this, but I've done a lot of reading and research on SVM and I do not think what you are saying is possible. Sure, in the case of the 2nd degree polynomial kernel you can enumerate the feature space induced by the kernel, if the number of attributes is very small. For higher-order polynomial kernels and larger numbers of attributes this quickly becomes intractable.

The power of the non-linear SVM is that it is able to induce feature spaces without having to do computation in that space, and in fact without actually knowing what that feature space is. Some kernels can even induce an infinitely dimensional feature space.

If you look back at your question, you can see part of the issue - you are looking for the primal weights. However, the kernel is something that is introduced in the dual form, where the data shows up as a dot product. Mathematically reversing this process would involve breaking the kernel function apart - knowing the mapping function from input space to feature space. Kernel functions are powerful precisely because we do not need to know this mapping. Of course it can be done for linear kernels, because there is no mapping function used.

1
votes

Extracting the weights of the explicit features is generally not computationally feasible, but a decent next-best-thing is the pre-image: generating a sample z such that its features correspond to the weights you're after.

This can be described formally as finding z such that phi(z) = w, with the weights implicitly defined as a combination of training samples, as is usual with the kernel trick: w=sum_i(alpha_i * phi(x_i)). phi is the feature map.

In general an exact pre-image does not exist, and most methods find the pre-image that minimizes the squared-error.

A good description of the classical Gauss-Newton iteration for pre-images of Gaussian kernels, as well as another more general method based on KPCA, is in:

James T. Kwok, Ivor W. Tsang, "The Pre-Image Problem in Kernel Methods", ICML 2003

Direct link: http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_KwokT03a.pdf