11
votes

I am having problems running logistic regression with xgboost that can be summarized on the following example.

Lets assume I have a very simple dataframe with two predictors and one target variable:

df= pd.DataFrame({'X1' : pd.Series([1,0,0,1]), 'X2' : pd.Series([0,1,1,0]), 'Y' : pd.Series([0,1,1,0], )})

I can post images because Im new here, but we can clearly see that when X1 =1 and X2=0, Y is 0 and when X1=0 and X2=1, Y is 1.

My idea is to build a model that outputs the probability that an observation belongs to each one of the classes, so if I run xgboost trying to predict two new observations (1,0) and (0,1) like so:

X = df[['X1','X2']].values            
y = df['Y'].values

params  = {'objective': 'binary:logistic',
          'num_class': 2
          } 

clf1 = xgb.train(params=params, dtrain=xgb.DMatrix(X, y), num_boost_round=100)                    
clf1.predict(xgb.DMatrix(test.values)) 

the output is:

array([[ 0.5,  0.5],
       [ 0.5,  0.5]], dtype=float32)

which, I imagine, means that for the first observation, there is 50% chance it belonging to each one of the classes.

I'd like to know why wont the algorithm output a proper (1,0) or something closer to that if the relationship between the variables is clear.

FYI, I did try with more data (Im only using 4 rows for simplicity) and the behavior is almost the same; what I do notice is that, not only the probabilities do not sum to 1, they are often very small like so: (this result is on a different dataset, nothing to do with the example above)

array([[ 0.00356463,  0.00277259],
       [ 0.00315137,  0.00268578],
       [ 0.00453343,  0.00157113],
1
Are your two predictors just 0s and 1s? If so there are only 4 possible combinations of your features and thus should only expect (at most) 4 unique probability predictions.David
yes they are. Ok, only 4 possible combinations, makes sense, but not sure how does that answer my question.Italo
I'm confused, what is your question? I thought you didn't understand why there was little variance in your probability predictions.David
my question is: why is the prediction (0.5, 0.5 -meaning 50% chance of being class 1 and 50% chance of being class 0) is if it clear that when X1 =1 and X2=0, Y is 0Italo

1 Answers

4
votes

Ok - here's what is happening..

The clue as to why it isn't working is in the fact that in the smaller datasets it cannot train properly. I trained this exact model and observing the dump of all the trees you will see that they cannot split.

(tree dump below)

NO SPLITS, they have been pruned!

[1] "booster[0]" "0:leaf=-0" "booster[1]" "0:leaf=-0" "booster[2]" "0:leaf=-0" [7] "booster[3]" "0:leaf=-0" "booster[4]" "0:leaf=-0" "booster[5]" "0:leaf=-0" [13] "booster[6]" "0:leaf=-0" "booster[7]" "0:leaf=-0" "booster[8]" "0:leaf=-0" [19] "booster[9]" "0:leaf=-0"

There isn't enough weight is each of the leaves to overpower xgboost's internal regularization (which penalizes it for growing)

This parameter may or may not be accessible from the python version, but you can grab it from R if you do a github install

http://xgboost.readthedocs.org/en/latest/parameter.html

lambda [default=1] L2 regularization term on weights

alpha [default=0] L1 regularization term on weights

basically this is why your example trains better as you add more data, but cannot train at all with only 4 examples and default settings.