0
votes

XGBoost gives me 100% prediction accuracy, for a binary classification problem. This seems too good to be true. How can i solve it?

I am using a normalized dataset (max-min or z-score), already split it as training and validation set, and I am using training set values in order to predict the validation set. In both subsets, data is very alike obviously, but there is nothing i can do about it. I also avoid look-forward bias. What else can be the possible reason for 100% accuracy and how can i solve it? Thank you very much!

My code is:

train_x=data.matrix(tmp[,-40])
train_y=tmp[,40]
test_x=data.matrix(tmp2[,-40])
test_y=tmp2[,40]
test_y=as.factor(test_y)

xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)


set.seed(12345)
xgbc=xgboost(data=xgb_train, max.depth=4, nrounds=200)
print(xgbc)

preds=predict(xgbc,test_x)
preds[preds>0.5] = "1"
pred_y = as.factor(test_y)
print(pred_y)

cm = confusionMatrix(test_y, pred_y)
print(cm)

Code output is:

> xgbc=xgboost(data=xgb_train,max.depth=4, nrounds=200, nthread=2, eta=1, 
objective="binary:logistic")
[1] train-error:0.415888 
[2] train-error:0.390654 
[3] train-error:0.368692 
[4] train-error:0.323832 
[5] train-error:0.307944 
[6] train-error:0.278037 
[7] train-error:0.259346 
[8] train-error:0.240187 
[9] train-error:0.232710 
[10]    train-error:0.224766 
[11]    train-error:0.208879 
[12]    train-error:0.192523 
[13]    train-error:0.185981 
[14]    train-error:0.177103 
[15]    train-error:0.168224 
[16]    train-error:0.157944 
[17]    train-error:0.141121 
[18]    train-error:0.132243 
[19]    train-error:0.132243 
[20]    train-error:0.121495 
[21]    train-error:0.109346 
[22]    train-error:0.101869 
[23]    train-error:0.100000 
[24]    train-error:0.090654 
[25]    train-error:0.080374 
[26]    train-error:0.078505 
[27]    train-error:0.069626 
[28]    train-error:0.063084 
[29]    train-error:0.066822 
[30]    train-error:0.056542 
[31]    train-error:0.044860 
[32]    train-error:0.042991 
[33]    train-error:0.039252 
[34]    train-error:0.037383 
[35]    train-error:0.029439 
[36]    train-error:0.023832 
[37]    train-error:0.018692 
[38]    train-error:0.011682 
[39]    train-error:0.011215 
[40]    train-error:0.010748 
[41]    train-error:0.009346 
[42]    train-error:0.007477 
[43]    train-error:0.005140 
[44]    train-error:0.005140 
[45]    train-error:0.006075 
[46]    train-error:0.003271 
[47]    train-error:0.002804 
[48]    train-error:0.003271 
[49]    train-error:0.002804 
[50]    train-error:0.002804 
[51]    train-error:0.002336 
[52]    train-error:0.002336 
[53]    train-error:0.002336 
[54]    train-error:0.002336 
[55]    train-error:0.000935 
[56]    train-error:0.000467 
[57]    train-error:0.000000 
[58]    train-error:0.000000 
[59]    train-error:0.000000 
[60]    train-error:0.000935 
[61]    train-error:0.000467 
[62]    train-error:0.000000 
[63]    train-error:0.000000 
[64]    train-error:0.000000 
[65]    train-error:0.000000 
[66]    train-error:0.000000 
[67]    train-error:0.000000 
[68]    train-error:0.000000 
[69]    train-error:0.000000 
[70]    train-error:0.000000 
[71]    train-error:0.000000 
[72]    train-error:0.000000 
[73]    train-error:0.000000 
[74]    train-error:0.000000 
[75]    train-error:0.000000 
[76]    train-error:0.000000 
[77]    train-error:0.000000 
[78]    train-error:0.000000 
[79]    train-error:0.000000 
[80]    train-error:0.000000 
[81]    train-error:0.000000 
[82]    train-error:0.000000 
[83]    train-error:0.000000 
[84]    train-error:0.000000 
[85]    train-error:0.000000 
[86]    train-error:0.000000 
[87]    train-error:0.000000 
[88]    train-error:0.000000 
[89]    train-error:0.000000 
[90]    train-error:0.000000 
[91]    train-error:0.000000 
[92]    train-error:0.000000 
[93]    train-error:0.000000 
[94]    train-error:0.000000  
[95]    train-error:0.000000 
[96]    train-error:0.000000 
[97]    train-error:0.000000 
[98]    train-error:0.000000 
[99]    train-error:0.000000 
[100]   train-error:0.000000    

> print(xgbc)
##### xgb.Booster
raw: 186.6 Kb 
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds, 
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
callbacks = callbacks, max.depth = 4, nthread = 2, eta = 1, 
objective = "binary:logistic")
params (as set within xgb.train):
max_depth = "4", nthread = "2", eta = "1", objective = "binary:logistic", 
silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log() 
# of features: 38 
niter: 200
nfeatures : 38 
evaluation_log:
iter train_error
   1    0.415888
   2    0.390654
---                 
 199    0.000000
 200    0.000000

preds=predict(xgbc,test_x)
> preds
[1] 7.273692e-01 1.643806e-02 3.032141e-04 9.764441e-01 9.691942e-02 
5.343258e-01 9.090783e-01
[8] 5.609832e-01 4.061035e-01 1.105066e-01 4.406907e-03 9.946358e-01 
7.929156e-01 4.119191e-03
[15] 3.098451e-01 2.945659e-04 3.966548e-03 7.829595e-01 1.698021e-01 
9.574184e-01 7.132806e-01
[22] 1.044374e-01 9.024003e-01 5.769060e-01 5.096554e-02 1.751429e-01 
9.982671e-01 9.993696e-01
[29] 6.521277e-01 5.780852e-03 4.867651e-01 9.707865e-01 8.398834e-01 
1.825542e-01 1.134274e-01
[36] 7.154977e-02 5.450470e-01 1.047506e-01 3.099218e-03 2.268739e-01 
9.023346e-01 8.026977e-01
[43] 3.844074e-01 4.463347e-01 8.543612e-01 9.998935e-01 8.699111e-01 
6.243381e-02 1.137973e-01
[50] 9.385086e-01 9.994442e-01 8.376440e-01 8.492180e-01 3.362629e-04 
4.316351e-02 9.234415e-01
[57] 8.924388e-01 9.977444e-01 6.618840e-02 2.186051e-04 1.647688e-03 
8.050095e-03 6.535615e-01
[64] 4.707330e-01 9.138927e-01 5.177013e-02 3.349773e-04 9.392425e-01 
4.979803e-02 2.934091e-01
[71] 8.948106e-01 9.854530e-01 9.795361e-02 9.275551e-01 5.865968e-01 
9.746857e-01 3.859183e-01
[78] 1.194406e-01 3.267710e-01 6.294726e-01 9.250816e-01 6.118813e-02 
3.394562e-01 7.257250e-04
[85] 8.491386e-01 7.081388e-03 3.268852e-01 8.931246e-01 2.204458e-01 
8.818560e-01 9.923303e-01
[92] 9.845840e-01 7.688413e-01 9.803721e-01 9.958567e-01 9.500723e-01 
7.733757e-01 9.368727e-01
[99] 3.276393e-01 9.952766e-01 2.130413e-01 8.992375e-02 8.594028e-02 
8.160641e-01 9.915828e-01

> preds[preds>0.5] = "1"
> preds[preds<=0.5]= "0"
> pred_y = as.factor(test_y)
> print(pred_y)
[1] 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 
1 0 1 1 1 0 1 0 1 1 1 1 0 0
[51] 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 
1 1 0 0 1 0 0 0 1 1 1 1 0 1

> test_y=as.factor(test_y)
> cm = confusionMatrix(test_y, pred_y)
> print(cm)
Confusion Matrix and Statistics

      Reference

Prediction 0 1 0 421 0 1 0 497

           Accuracy : 1         
             95% CI : (0.996, 1)
No Information Rate : 0.5414    
P-Value [Acc > NIR] : < 2.2e-16 

              Kappa : 1         

Mcnemar's Test P-Value : NA

        Sensitivity : 1.0000    
        Specificity : 1.0000    
     Pos Pred Value : 1.0000    
     Neg Pred Value : 1.0000    
         Prevalence : 0.4586    
     Detection Rate : 0.4586    

Detection Prevalence : 0.4586
Balanced Accuracy : 1.0000

   'Positive' Class : 0
2
Can you paste the output from xgboost iterations, both for train and test set.user2974951
Impossible to solve without the code, unfortunately. Most likely you accidentally compare true classes with true classes instead of predicted classes.JBGruber
I just added the code, I hope it helps. Thank you very much!baris
Can you post the output of the model building process, all the iterations (all the measures). Also, have you checked that tmp and tmp2 are different?user2974951
I have just added the outputs. Prediction 0 1 0 421 0 1 0 497 states the confusion matrix, in which i have 100% accuracy. I have also eyeballed that tmp and tmp2 are different, tmp is the training set and tmp2 is the validation. However, how can i be sure that they are statistically different from each other? Apart from that, what can be the problem? Thank you very much!baris

2 Answers

0
votes

It seems like you're seriously overfitting to your training data, and you should use cross validation instead of just a naive train-test split. There are a number of ways to do this. You can do it with xgb.cv inside of the xgboost package in R, for one. I prefer Tidymodels, but that's a different rabbit hole. My guess is that if you tune a parameter like gamma, you'll end up with non-zero loss because the gamma > 0 will help prevent overfitting by pruning your trees. You can also help prevent overfitting by growing fewer, shallower trees, subsampling features, etc. All of these options can be tuned with xgb.cv

-2
votes

Try checking the correlation of the predictor variables with the output. Try removing variables with high correlation because it introduces high bias. This solved my issue with 100% accuracy.