5
votes

Although it wasn't originally conceived this way, the standard Adaboost algorithm is equivalent to conducting a forward stagewise additive model estimation using an exponential loss function. That is, given some weak classifiers c1,...,cM, and sample points x1,...,xN the weights coming out of the algorithm:

  1. Set F_0(x) = 0
  2. For m = 1 to M: Set (w_m, f_m ) = arg min over (w, c_i) of (Loss function(y_i, F_m-1(x_i) + w * c_i(x_i)) applied to all x_i's)
  3. Set F_m(x) = F_m-1(x) + w_m * f_m(x)

The strong classifier is the output, F_M(x). The loss function that makes this strong learner the same as the Adaboost output is

L(y,f(x)) = exp(-y*f(x))

for classifiers taking values in {-1,1}. This is all explained in Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Section 10.4.

My question has to do with forward stagewise regression. It is a greedy algorithm, in that once w_i is estimated, it is fixed, then the weight w_i+1 is found, etc. This seems like it is really designed to handle "independent" weak classifiers, like stump classifiers, or tree classifiers restricted to mutually exclusive independent variables (features), so that the residual piece after fitting a classifier is not explained by that classifier any more.

In other words, to fit a set of functions to an given target function, I wouldn't fit the first one, fix that coefficient, then find the optimal coefficient for the second, holding the first constant, etc... unless I knew that the functions were independent. But this is what the algorithm does, more or less.

Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model? Can someone give a reference? - I have not seen this aspect discussed in the literature. Thanks.

3
I'm reading it - it seems relevant here. Thanks.Charles Pehlivanian
After a quick read, I'm of the opinion that the paper does not answer my question (!). It does provide some leads that I am investigating. It's the greediness of the above algorithm that bothers me, unless one chooses the right classifiers to account for this. I guess that is my question.Charles Pehlivanian
The paper outlines some of the methods used to analyze the performance of boosting... these methods can actually be very difficult. I think the real answer to your question is that an explanation boosting's learning mechanism is still not completely clear in certain cases. By looking into the assumptions that go into each analysis method - (1) the original weak learner hypothesis coupled with VC analysis (2) analysis of the boosting margin - you can get an intuitive idea about how different factors should affect the generalization error.user1149913
If you're able to, check Schapire and Freund's book. The provide a lot o theoretical explanations to the algorithm. It is pretty complete and comprehensive, and I guess you might find your answer there.Ramiro

3 Answers

6
votes

I think you may be a bit confused, or using terminology I'm not familiar with. Nothing in AdaBoost or more general stage wise additive models is independent or mutually exclusive, nor were the designed to be mutually exclusive.

Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model?

No. Methods that produce an ensemble of classifiers can be powerful, and their power comes mostly from the ability to reduce the error caused by variance of the base model. AdaBoost & others can also reduce the bias, but it is much easier to reduce variance induced error.

For this reason we use Decision Trees, as we can control the level of bias/variance on the tree by altering the maximum depth of the tree. This makes life easy, but they are not the end all of boosting (example: boosting in a high dimensional space is quite difficult, and trees are horrible in such situations).

We don't usually use linear models in boosting because they simply aren't that good at it. We can produce "easy" data sets that will not converge well being boosted by linear models without too much thought (consider 1 ring in another where each class is of equal size, and a base learner that cuts the inner (and therefor outer) ring in half). A lowly decision stump is often better simply because it has a non-linearity in it, allowing for a much faster adaption to the data.

We avoid complex models such as SVMs because they take a long time to train. Regardless of what base model you choose, AdaBoost is going to run towards the same type of solution (it tries to maximize the L1 margin, where SVMs maximize the L2 margin). If you have to boost 1000 trees, or 500 SVMs, its going to probably be a lot faster to boost the trees. This doesn't even get into all the parameter search you would have to do for each SVM for each model added. It is simply too time consuming. However, there are cases where this can work well - here is a face detection case.

There is also the issue of prediction time. If you need to boost 100 or 1000 models, its going to increase the prediction time by a 2 or 3 orders of magnitude. SMVs are already not the fastest predictors, and this only makes the situation worse.

The details of this are more picked up from the math then discussed in english. If you are interested in more explicit discussion of the issue about why such models work, read some of Leo Breiman's papers.

2
votes

The idea behind adaboost is each tree, or more generally weak learner, was trained individually and was trained in a certain order and weight on all of the outcomes. The first classifier is trained on the base input and a weight for the tree is calculated, the second classifier is trained with the elements that the first tree classified correctly weighted lower and the elements that the first tree misclassified weighted higher and a weight is learned for the second classifier, the third classifier is trained with things that the new classifier of the first and second tree classified correctly weighted lower and the things that they classified incorrectly weighted higher, and so on and so forth.

0
votes

Think about the simplest example in which you will fit a 1D curve with a linear model. Instead of approximating, you are going to learn the curve. So each time you pick up two data sets to learn the line crossing them. After plenty learning times, the line will be obtained by averaging all the parameters (weights) you learned. Such line will achieve the lowest in-sample error. And this is equivalent to the learning process in which you update the previous parameters given the new training set.

I am not sure whether I understand your question correctly, but if you are trying to fit with different models (linear, quadratic, cubic,exponential...) for the example above, the number of weights for each model is not the same. So the greedy approach as what people do in classification problems may not well-fit. One resolving method may be: you give the weight on each model, and use boosting to determine which model is best fit for the training data.

An alternative to do this regression is to use neural network as the weak learner. Here is one study that applied back-propagation on neural network. Every time a subset of the training set was randomly picked for the learning and boosting process. And stagewise additive modeling is used to update the weight. The error calculation and weight factor is slightly different but in a similar form with what are used in classification. The result indicates ada-neural network is more stable than back-propagation in the regression.

In classification problem, I am trying to understand why AdaBoost with stump learners is better than with SVM? Since AdaBoost is a greedy feature selector, given the same feature set, SVM is expected to outperform AdaBoost, isn't it? Actually it is feasible to use AdaBoost to select important features and SVM to classify examples. You can also build a AdaBoost tree that put the features falling into the SVM margin to the children nodes, and re-train them with SVM until they are correctly classified.