0
votes

I'm trying to investigate the xgboost prediction.

It seems that 2 inputs with same 2 paths gives 2 different predictions.

I'm running on the following dateset:

f1,f2,f3,f4,f5,f6,f7,f8,y
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1
9,119,80,35,0,29.0,0.263,29,1
11,143,94,33,146,36.6,0.254,51,1
10,125,70,26,115,31.1,0.205,41,1
7,147,76,0,0,39.4,0.257,43,1
1,97,66,15,140,23.2,0.487,22,0
13,145,82,19,110,22.2,0.245,57,0
5,117,92,0,0,34.1,0.337,38,0
5,109,75,26,0,36.0,0.546,60,0
3,158,76,36,245,31.6,0.851,28,1
3,88,58,11,54,24.8,0.267,22,0
6,92,92,0,0,19.9,0.188,28,0
10,122,78,31,0,27.6,0.512,45,0
4,103,60,33,192,24.0,0.966,33,0
11,138,76,0,0,33.2,0.420,35,0
9,102,76,37,0,32.9,0.665,46,1
2,90,68,42,0,38.2,0.503,27,1

prediction and tree creation code:

df = pd.read_csv("input.csv")
x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]
y = df[['y']]
X_train, X_test, y_train, y_test = train_test_split( x, y, test_size = 0.33, random_state = 42)
model = XGBClassifier(n_jobs=-1)
model.fit(X_train, y_train)
res = model.predict(X_test)
print ("X_test (first 2 rows:")
print(X_test.head(2))
print("Predictions (first 2 rows:")
print(res[0:2])    
plot_tree(model)
plt.show()

Output:

X_test (first 2 rows:
    f1   f2  f3  f4  f5    f6     f7  f8
33   6   92  92   0   0  19.9  0.188  28
36  11  138  76   0   0  33.2  0.420  35
Predictions (first 2 rows:
[0 1]

enter image description here

The same 2 inputs have f2<146.5 and f4=0 => get into same leaf (-0.34) So why the prediction for those 2 are different ? (0 and 1) ?

1

1 Answers

2
votes

What you have plotted in not the whole XGBoost model; it is only the first tree of it.

To see why this is so, check the source code of plot_tree:

def plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs):
    """Plot specified tree.

and the documentation:

num_trees (int, default 0) – Specify the ordinal number of target tree

from where it is apparent that, when you do not specify the num_trees argument, like here, it takes the default value of 0, i.e. the first tree of the ensemble.

Using different values for num_trees you will get different trees, hence different decision paths for each sample.

You cannot plot all the trees of the boosting ensemble (and even if you could, it would not be of any practical use). plot_tree is just a utility function in order to be able to have a look at individual trees of the model. You can have a look at its use in How to Visualize Gradient Boosting Decision Trees With XGBoost in Python.