1
votes

I have unbalanced dataset (6% positive), I have used and xgboost model from caret package.

this is my code:

gbmGrid <- expand.grid(nrounds = 50,
                       eta = 0.4,
                       max_depth = 2,
                       gamma = 0,
                       colsample_bytree=0.8,
                       min_child_weight=1,
                       subsample=1)

ctrl <- trainControl(method = "cv",
                     number = 10,
                     search = "grid", 
                     fixedWindow = TRUE,
                     verboseIter = TRUE,
                     returnData = TRUE,
                     returnResamp = "final",
                     savePredictions = "all",
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "smote",
                     selectionFunction = "best",
                     trim = FALSE,
                     allowParallel = TRUE)

classifier <- train(x = training_set[,-1],y = training_set[,1], method = 'xgbTree',metric = "ROC",trControl = ctrl,tuneGrid = gbmGrid)

the problem is that every time I "run" the train line it gives different roc, sensetivity and specificity.

  ROC       Sens       Spec     
  0.696084  0.8947368  0.2736111

  ROC        Sens       Spec     
  0.6655806  0.8917293  0.2444444

** the expand.grid is set on the best tune model.

does someone understand why the model isn't stable?

2

2 Answers

1
votes

As Vivek Kumar mentioned in his answer, boosting algorithms are stochastic algorithms. In addition, you are splitting your dataset with trainControl, which also introduces a source of randomness. Using set.seed to fix the inital randomness allows you to always get the same result, but it may be a lucky (or unlucky) one, so it is best to be avoided.

A better approach is to run the your code sample multiple times, for instance 10 times, until you are confident enough that the mean performance over multiple random initializations is correct. You can then report this mean (and ideally the standard deviation as well). In this case, do not use set.seed or you won't get any variation.

0
votes

This is because of the randomness in selecting the features for split by the xgboost.

Add the following line before your actual training code:

set.seed(100)

You can use any integer in place of 100.

This will set a seed for pseudorandom number generator, which will then generate exact same sequence of random numbers each time. So each time the code is called, the results will be same.