2
votes

I have this confusion related to kernel svm. I read that with kernel svm the number of support vectors retained is large. That's why it is difficult to train and is time consuming. I didn't get this part why is it difficult to optimize. Ok I can say that noisy data requires large number of support vectors. But what does it have to do with the training time.

Also I was reading another article where they were trying to convert non linear SVM kernel to linear SVM kernel. In the case of linear kernel it is just the dot product of the original features themselves. But in the case of non linear one it is RBF and others. I didn't get what they mean by "manipulating the kernel matrix imposes significant computational bottle neck". As far as I know, the kernel matrix is static isn't it. For linear kernel it is just the dot product of the original features. In the case of RBF it is using the gaussian kernel. So I just need to calculate it once, then I am done isn't it. So what's the point of manipulating and the bottleneck thinkg

Support Vector Machine (SVM) (Cortes and Vap- nik, 1995) as the state-of-the-art classification algo- rithm has been widely applied in various scientific do- mains. The use of kernels allows the input samples to be mapped to a Reproducing Kernel Hilbert S- pace (RKHS), which is crucial to solving linearly non- separable problems. While kernel SVMs deliver the state-of-the-art results, the need to manipulate the k- ernel matrix imposes significant computational bottle- neck, making it difficult to scale up on large data.

1

1 Answers

3
votes

It's because the kernel matrix is a matrix that is N rows by N columns in size where N is the number of training samples. So imagine you have 500,000 training samples, then that would mean the matrix needs 500,000*500,000*8 bytes (1.81 terabytes) of RAM. This is huge and would require some kind of parallel computing cluster to deal with in any reasonable way. Not to mention the time it takes to compute each element. For example, if it took your computer 1 microsecond to compute 1 kernel evaluation then it would take 69.4 hours to compute the entire kernel matrix. For comparison, a good linear solver can handle a problem of this size in a few minutes or an hour on a regular desktop workstation. So that's why linear SVMs are preferred.

To understand why they are so much faster you have to take a step back and think about how these optimizers work. At the highest level you can think of them as searching for a function that gives the correct outputs on all the training samples. Moreover, most solvers are iterative in the sense that they have a current best guess at what this function should be and in each iteration they test it on the training data and see how good it is. Then they update the function in some way to improve it. They keep doing this until they find the best function.

Keeping this in mind, the main reason why linear solvers are so fast is because the function they are learning is just a dot product between a fixed size weight vector and a training sample. So in each iteration of the optimization it just needs to compute the dot product between the current weight vector and all the samples. This takes O(N) time, moreover, good solvers converge in just a few iterations regardless of how many training samples you have. So the working memory for the solver is just the memory required to store the single weight vector and all the training samples. This means the entire process takes only O(N) time and O(N) bytes of RAM.

A non-linear solver on the other hand is learning a function that is not just a dot product between a weight vector and a training sample. In this case, it is a function that is the sum of a bunch of kernel evaluations between a test sample and all the other training samples. So in this case, just evaluating the function you are learning against one training sample takes O(N) time. Therefore, to evaluate it against all training samples takes O(N^2) time. There have been all manner of clever tricks devised to try and keep the non-linear function compact to speed this up. But all of them are at least a little bit heuristic or approximate in some sense while good linear solvers find exact solutions. So that's part of the reason for the popularity of linear solvers.