Adaboost and forward stagewise additive modeling

Question

Although it wasn't originally conceived this way, the standard Adaboost algorithm is equivalent to conducting a forward stagewise additive model estimation using an exponential loss function. That is, given some weak classifiers c1,...,cM, and sample points x1,...,xN the weights coming out of the algorithm:

Set F_0(x) = 0
For m = 1 to M: Set (w_m, f_m ) = arg min over (w, c_i) of (Loss function(y_i, F_m-1(x_i) + w * c_i(x_i)) applied to all x_i's)
Set F_m(x) = F_m-1(x) + w_m * f_m(x)

The strong classifier is the output, F_M(x). The loss function that makes this strong learner the same as the Adaboost output is

L(y,f(x)) = exp(-y*f(x))

for classifiers taking values in {-1,1}. This is all explained in Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Section 10.4.

My question has to do with forward stagewise regression. It is a greedy algorithm, in that once w_i is estimated, it is fixed, then the weight w_i+1 is found, etc. This seems like it is really designed to handle "independent" weak classifiers, like stump classifiers, or tree classifiers restricted to mutually exclusive independent variables (features), so that the residual piece after fitting a classifier is not explained by that classifier any more.

In other words, to fit a set of functions to an given target function, I wouldn't fit the first one, fix that coefficient, then find the optimal coefficient for the second, holding the first constant, etc... unless I knew that the functions were independent. But this is what the algorithm does, more or less.

Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model? Can someone give a reference? - I have not seen this aspect discussed in the literature. Thanks.

Have you read this paper? cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf — user1149913
After a quick read, I'm of the opinion that the paper does not answer my question (!). It does provide some leads that I am investigating. It's the greediness of the above algorithm that bothers me, unless one chooses the right classifiers to account for this. I guess that is my question. — Charles Pehlivanian
The paper outlines some of the methods used to analyze the performance of boosting... these methods can actually be very difficult. I think the real answer to your question is that an explanation boosting's learning mechanism is still not completely clear in certain cases. By looking into the assumptions that go into each analysis method - (1) the original weak learner hypothesis coupled with VC analysis (2) analysis of the boosting margin - you can get an intuitive idea about how different factors should affect the generalization error. — user1149913
If you're able to, check Schapire and Freund's book. The provide a lot o theoretical explanations to the algorithm. It is pretty complete and comprehensive, and I guess you might find your answer there. — Ramiro

Raff.Edward Raff.Edward · Accepted Answer · 2013-12-12T03:53:51

I think you may be a bit confused, or using terminology I'm not familiar with. Nothing in AdaBoost or more general stage wise additive models is independent or mutually exclusive, nor were the designed to be mutually exclusive.

Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model?

No. Methods that produce an ensemble of classifiers can be powerful, and their power comes mostly from the ability to reduce the error caused by variance of the base model. AdaBoost & others can also reduce the bias, but it is much easier to reduce variance induced error.

For this reason we use Decision Trees, as we can control the level of bias/variance on the tree by altering the maximum depth of the tree. This makes life easy, but they are not the end all of boosting (example: boosting in a high dimensional space is quite difficult, and trees are horrible in such situations).

We don't usually use linear models in boosting because they simply aren't that good at it. We can produce "easy" data sets that will not converge well being boosted by linear models without too much thought (consider 1 ring in another where each class is of equal size, and a base learner that cuts the inner (and therefor outer) ring in half). A lowly decision stump is often better simply because it has a non-linearity in it, allowing for a much faster adaption to the data.

We avoid complex models such as SVMs because they take a long time to train. Regardless of what base model you choose, AdaBoost is going to run towards the same type of solution (it tries to maximize the L1 margin, where SVMs maximize the L2 margin). If you have to boost 1000 trees, or 500 SVMs, its going to probably be a lot faster to boost the trees. This doesn't even get into all the parameter search you would have to do for each SVM for each model added. It is simply too time consuming. However, there are cases where this can work well - here is a face detection case.

There is also the issue of prediction time. If you need to boost 100 or 1000 models, its going to increase the prediction time by a 2 or 3 orders of magnitude. SMVs are already not the fastest predictors, and this only makes the situation worse.

The details of this are more picked up from the math then discussed in english. If you are interested in more explicit discussion of the issue about why such models work, read some of Leo Breiman's papers.

Adaboost and forward stagewise additive modeling

3 Answers