Designing a Kernel for a support vector machine (XOR)

21

votes

The meat of my question is "how does one design a kernel function for a learning problem?"

As a quick background, I'm reading books on support vector machines and kernel machines, and everywhere I look authors give examples of kernels (polynomial kernels both homogeneous and nonhomogeneous, gaussian kernels, and allusions to text-based kernels to name a few), but all either provide pictures of the results without specifying the kernel, or vaguely claim that "an efficient kernel can be constructed". I'm interested in the process that goes on when one designs a kernel for a new problem.

Probably the easiest example is learning XOR, a smallest (4 points) non-linear data set as embedded the real plane. How would one come up with a natural (and non-trivial) kernel to linearly separate this data?

As a more complex example (see Cristianini, Introduction to SVMs, figure 6.2), how would one design a kernel to learn a checkerboard pattern? Cristianini states the picture was derived "using Gaussian kernels" but it seems that he uses multiple, and they are combined and modified in an unspecified way.

If this question is too broad to answer here, I'd appreciate a reference to the construction of one such kernel function, though I'd prefer the example be somewhat simple.

kernelmachine-learningsvm

Have you implement xor logic gate with svm? - user11528241

9

votes

Q: "How does one design a kernel function for a learning problem?"

A: "Very carefully"

Trying the usual suspects (linear, polynomial, RBF) and using whichever works the best really is sound advice for someone trying to get the most accurate predictive model they can. For what it's worth it's a common criticism of SVMs that they seem to have a lot of parameters that you need to tune empirically. So at least you're not alone.

If you really want to design a kernel for a specific problem then you are right, it is a machine learning problem all in itself. It's called the 'model selection problem'. I'm not exactly an expert myself here, but the best source of insight into kernel methods for me was the book 'Gaussian Processes' by Rasumussen and Williams (it's freely available online), particularly chapters 4 and 5. I'm sorry that I can't say much more than 'read this huge book full of maths' but it's a complicated problem and they do a really good job of explaining it.

6

votes

(For anyone not familiar with the use of kernel functions in Machine Learning, kernels just maps the input vectors (data points that comprise the data set) into a higher-dimensional space, aka, the "Feature Space". The SVM then finds a separating hyperplane with the maximal margin (distance between the hyperplane and the support vectors) in this transformed space.)

Well, start with kernels that are known to work with SVM classifiers to solve the problem of interest. In this case, we know that the RBF (radial basis function) kernel w/ a trained SVM, cleanly separates XOR. You can write an RBF function in Python this way:

def RBF():
    return NP.exp(-gamma * NP.abs(x - y)**2)

In which gamma is 1/number of features (columns in the data set), and x, y are a Cartesian pair.

(A radial basis function module is also in scipy.interpolate.Rbf)

Second, if what you are after is not just using available kernel functions to solve classification/regression problems, but instead you want to build your own, i would suggest first studying how the choice of kernel function and the parameters inside those functions affect classifier performance. The small group of kernel functions in common use with SVM/SVC, is the best place to start. This group is comprised of (aside from RBF):

linear kernel
polynomial
sigmoid

1

votes

My approach would be to study the data: how would I separate the points in the XOR problem? When I started studying about M.L. in general, and SVM in particular that is what I did, took toy problem, draw it by hand, and try to separate the classes.

When I looked at the XOR problem the firs time, it occurred to me that both purple points (below, left) have X and Y of the same sign, in one case negative in one positive, whereas both green points have X and Y of opposite signs. Therefore, the squared sum of X and Y would be 0 (or very small with a bit of noise in the initial problem) for the green points, and 2 (or nearly 2) for the purple ones. Hence, adding a third coordinate Z = np.sqrt(np.square(X + Y)) will nicely separate the two sets:

On a side note, Z is not too dissimilar formulation from doug's rbf if you consider that np.sqrt(np.square(X + Y)) is essentially the same as as np.abs(X + Y) in this case.

I do not have access to Crisitanini's paper but I'd approach that problem too in a similar way, starting with a toy version (by the way, checkerboard code thanks to none other than doug):

A possible intuition here is that the sum of row and column indices for the black squares would be always even, whereas for the white squares would be always odd, so adding as a third dimension something like (row_index + col_index) % 2 would do the trick in this simple version. In a larger, more complex checkerboard dataset, like this I found on the web:

things aren't so simple, but perhaps one could cascade clustering to find the the 16 clusters' mean X and Y locations (perhaps using medoids clustering), then apply a version of the "modulo kernel trick"?

With the disclaimer that I have not worked with a ton of classification problems, so far I have found that in making a toy version of a complex one, I've usually gained a 'numerical' intuition as to the kind of solution that might work.

Finally, as posted in a comment to doug's answer, I don't find anything wrong with an empirical approach like his, studying the performance of all possible kernels by passing them to grid search in nested crossvalidation with the same algorithm (SVC) and changing only the kernel. You could add to that approach by plotting the respective margins in the transformed feature spaces: for example, for rbf, using the equation suggested by Doug (and Sebastian Raschka's routine for plotting decision regions - cell 13 here).

UPDATE October 27/17 In a conversation on my slack channel, another geophysicist asked me about the case in which the XOR gate is designed as 0s and 1s as opposed to -1s and 1s (the latter being similar to a classic problem in exploration geophysics, hence my initial toy example).

If I were to tackle the XOR gate with 0s and 1s, and did not have at disposal the knowledge about the rbf kernel, in this case too I'd sit and loo at the problem in terms of the coordinates of those problems and see if I could come up with a transformation.

My first observation here was that the Os sit on the x=y line, the Xs on the x=-y line, so the difference x-y would be 0 (or small with a bit of noise) in on case, +/-1 in the other, respectively. The absolute value would take care of the sign, hence Z = np.abs(X-Y) would work. Which, by the way, is very similar to doug's rbf = np.exp(-gamma * np.abs(x - y)**2) (another reason to upvote his answer); and in fact, his rbf is a more general solution, working in all XOR cases.

0

votes

I am looking for some polynomial kernel work through examples and stumbled on this post. A couple of things that might help if you are still looking are this toolkit (http://www2.fml.tuebingen.mpg.de/raetsch/projects/shogun) which uses multiple kernel learning, where you can choose a wide selection of kernel methods and then the learning will choose the best for the problem, so you don't have to.

An easier more traditional method for your choice of kernel is to use cross-validation with different kernel methods to find the best.

Hope this helps you or anyone else reading around kernel methods.

Designing a Kernel for a support vector machine (XOR)

4 Answers