Although it wasn't originally conceived this way, the standard Adaboost algorithm is equivalent to conducting a forward stagewise additive model estimation using an exponential loss function. That is, given some weak classifiers c1,...,cM, and sample points x1,...,xN the weights coming out of the algorithm:
- Set F_0(x) = 0
- For m = 1 to M: Set (w_m, f_m ) = arg min over (w, c_i) of (Loss function(y_i, F_m-1(x_i) + w * c_i(x_i)) applied to all x_i's)
- Set F_m(x) = F_m-1(x) + w_m * f_m(x)
The strong classifier is the output, F_M(x). The loss function that makes this strong learner the same as the Adaboost output is
L(y,f(x)) = exp(-y*f(x))
for classifiers taking values in {-1,1}. This is all explained in Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Section 10.4.
My question has to do with forward stagewise regression. It is a greedy algorithm, in that once w_i is estimated, it is fixed, then the weight w_i+1 is found, etc. This seems like it is really designed to handle "independent" weak classifiers, like stump classifiers, or tree classifiers restricted to mutually exclusive independent variables (features), so that the residual piece after fitting a classifier is not explained by that classifier any more.
In other words, to fit a set of functions to an given target function, I wouldn't fit the first one, fix that coefficient, then find the optimal coefficient for the second, holding the first constant, etc... unless I knew that the functions were independent. But this is what the algorithm does, more or less.
Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model? Can someone give a reference? - I have not seen this aspect discussed in the literature. Thanks.