diffrent result in every run when using xgboost model in caret package

Question

I have unbalanced dataset (6% positive), I have used and xgboost model from caret package.

this is my code:

gbmGrid <- expand.grid(nrounds = 50,
                       eta = 0.4,
                       max_depth = 2,
                       gamma = 0,
                       colsample_bytree=0.8,
                       min_child_weight=1,
                       subsample=1)

ctrl <- trainControl(method = "cv",
                     number = 10,
                     search = "grid", 
                     fixedWindow = TRUE,
                     verboseIter = TRUE,
                     returnData = TRUE,
                     returnResamp = "final",
                     savePredictions = "all",
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "smote",
                     selectionFunction = "best",
                     trim = FALSE,
                     allowParallel = TRUE)

classifier <- train(x = training_set[,-1],y = training_set[,1], method = 'xgbTree',metric = "ROC",trControl = ctrl,tuneGrid = gbmGrid)

the problem is that every time I "run" the train line it gives different roc, sensetivity and specificity.

  ROC       Sens       Spec     
  0.696084  0.8947368  0.2736111

  ROC        Sens       Spec     
  0.6655806  0.8917293  0.2444444

** the expand.grid is set on the best tune model.

does someone understand why the model isn't stable?

Calimo Calimo · Accepted Answer · 2017-04-13T12:21:12

As Vivek Kumar mentioned in his answer, boosting algorithms are stochastic algorithms. In addition, you are splitting your dataset with trainControl, which also introduces a source of randomness. Using set.seed to fix the inital randomness allows you to always get the same result, but it may be a lucky (or unlucky) one, so it is best to be avoided.

A better approach is to run the your code sample multiple times, for instance 10 times, until you are confident enough that the mean performance over multiple random initializations is correct. You can then report this mean (and ideally the standard deviation as well). In this case, do not use set.seed or you won't get any variation.

diffrent result in every run when using xgboost model in caret package

2 Answers