0
votes

I'm using xgboost to build a model. The dataset only has 200 rows and 10000 cols.

I tried chi-2 to get 100 cols, but my confusion matrix looks like this:

    1 0
1 190 0
0  10 0

I tried to use 10000 attributes, randomly select 100 attributes, select 100 attributes according to chi-2, but I never get 0 case predicted. Is it because of the dataset, or because the way I use xgboost?

My factor(pred.cv) is always showing only 1 level, while factor(y+1) has 1 or 2 as levels.

param <- list("objective" = "binary:logistic",
          "eval_metric" = "error",
          "nthread" = 2,
          "max_depth" = 5,
          "eta" = 0.3,
          "gamma" = 0,
          "subsample" = 0.8,
          "colsample_bytree" = 0.8,
          "min_child_weight" = 1,
          "max_delta_step"= 5,
          "learning_rate" =0.1,
          "n_estimators" = 1000,
          "seed"=27,
          "scale_pos_weight" = 1
          )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level
2
What is the proportion of the 1/2 levels in factor(y+1)?. If imbalanced perhaps try changing scale_pos_weight.missuse
@missuse only 10%, I will try that first!user7700501
@missuse I think I have some clue. from online it says to do this: "pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/num.class, ncol=num.class)", but when I do this, num.class = 2, my confusionMatrix(factor(y+1), factor(pred.cv)) returns error "all arguments must have the same length". Because factor(y+1) is 980, but factor(pred.cv) is 980/2. Do you know how to fix it?user7700501
Try pred.cv = ifelse(bst$pred < 0.5, 0, 1) and table(pred.cv, y).missuse
@missuse I actually figure that out 5 minutes ago, that I didn't convert prob. to yes or no :/ thanks!user7700501

2 Answers

0
votes

test

param <- list("objective" = "binary:logistic",
      "eval_metric" = "error",
      "nthread" = 2,
      "max_depth" = 5,
      "eta" = 0.3,
      "gamma" = 0,
      "subsample" = 0.8,
      "colsample_bytree" = 0.8,
      "min_child_weight" = 1,
      "max_delta_step"= 5,
      "learning_rate" =0.1,
      "n_estimators" = 1000,
      "seed"=27,
      "scale_pos_weight" = 1
      )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level
1
votes

I found caret to be slow and it is not able to tune all the parameters of xgboost models without building a custom model which is quite more complicated than using ones own custom function for evaluation.

However if you are doing some up/down sampling or smote/rose caret is the way to go since it incorporates them correctly in the model evaluating phase (during re-sampling). See: https://topepo.github.io/caret/subsampling-for-class-imbalances.html

However I found these techniques have a very small impact on the results and usually for the worse, at least in the models I trained.

scale_pos_weight gives a higher weight to a certain class, if the minority class is at 10% abundance then playing with scale_pos_weight around 5 - 10 should be beneficial.

Tuning regularization parameters can be quite beneficial for xgboost: here one has several parameters: alpha, beta and gamma - I found valid values to be 0 - 3. Other useful parameters that add direct regularization (by adding uncertainty) are subsample, colsample_bytree and colsample_bylevel. I found that playing with colsample_bylevel can also have a positive outcome on the model. subsample and colsample_bytree you are already utilizing.

I would test a much smaller eta and more trees to see if the model benefits. early_stopping_rounds rounds can speed up the process in that case.

Other eval_metric are probably going to be more beneficial than accuracy. Try logloss or auc and even map and ndcg

Here is a function for grid search of hyper-parameters. It uses auc as evaluation metric but one can change that easily

xgb.par.opt=function(train, seed){
  require(xgboost)
  ntrees=2000
  searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1), 
                                  colsample_bytree = c(0.6, 0.8, 1),
                                  gamma = c(0, 1, 2),
                                  eta = c(0.01, 0.03),
                                  max_depth = c(4,6,8,10))
  aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){

    #Extract Parameters to test
    currentSubsampleRate <- parameterList[["subsample"]]
    currentColsampleRate <- parameterList[["colsample_bytree"]]
    currentGamma <- parameterList[["gamma"]]
    currentEta =parameterList[["eta"]]
    currentMaxDepth =parameterList[["max_depth"]]
    set.seed(seed)

    xgboostModelCV <- xgb.cv(data = train, 
                             nrounds = ntrees,
                             nfold = 5,
                             objective = "binary:logistic",
                             eval_metric= "auc",
                             metrics = "auc",
                             verbose = 1,
                             print_every_n = 50,
                             early_stopping_rounds = 200,
                             stratified = T,
                             scale_pos_weight=sum(all_data[train,1]==0)/sum(all_data[train,1]==1),
                             max_depth = currentMaxDepth, 
                             eta = currentEta, 
                             gamma = currentGamma,
                             colsample_bytree = currentColsampleRate,
                             min_child_weight = 1,
                             subsample =  currentSubsampleRate
                             seed = seed) 


    xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)

    auc = xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
    auc = cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta,  currentMaxDepth)
    names(auc) = c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
    print(auc)
    return(auc)
  })
  return(aucErrorsHyperparameters)
}

One can add other parameters to the expand.grid call.

I usually train hyper-parameters on one CV repetition and evaluate them on additional repetitions with other seeds or on the validation set (but doing it on validation set should be used with caution to avoid over-fitting)