How to predict accuracy of xgboost binary choice model?

Question

I am constructing an XGBoost model for a binary choice prediction. However, I am having trouble generating predictions. How do I go from the end of this code to actual predictions on the test data? My code has 7 independent variables, and one dependent variable, which is a binary choice.

choice <- dataset_training$choiceprobX
set.seed(1234)
ind <- sample(2, nrow(dataset_training), replace=TRUE, prob=c(0.67, 0.33))
training <- as.matrix(dataset_training[ind==1, 1:7])
head(training)
testing <- as.matrix(dataset_training[ind==2, 1:7])
head(testing)
dataset_trainLabel <- dataset_training[ind==1, 8]
head(dataset_trainLabel)
dataset_testLabel <- dataset_training[ind==2, 8]
head(dataset_testLabel)
xgb.train <- xgb.DMatrix(data=training,label=dataset_trainLabel)
xgb.test <- xgb.DMatrix(data=testing,label=dataset_testLabel)
params = list(
  booster="gbtree",
  eta=0.01,
  max_depth=5,
  gamma=3,
  subsample=0.75,
  colsample_bytree=1,
  objective="binary:logistic",
  eval_metric="logloss"
)
xgb.fit=xgb.train(
  params=params,
  data=xgb.train,
  nrounds=10,
  nthreads=1,
  early_stopping_rounds=10,
  watchlist=list(val1=xgb.train,val2=xgb.test),
  verbose=0
)
xgb.fit

My goal is to generate a confusion matrix, but when I do it, it tells me that the data and reference must be factors of the same level.

StupidWolf StupidWolf · Accepted Answer · 2020-03-27T19:08:18

Let's use an example dataset iris, since I do not have your data:

set.seed(100)
data = iris
data$Species = as.numeric(data$Species=="versicolor")
idx = sample(nrow(data),100)

dtrain <- xgb.DMatrix(as.matrix(data[idx,-5]), label = data$Species[idx])
dtest <- xgb.DMatrix(as.matrix(data[-idx,-5]), label = data$Species[-idx])

param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
             objective = "binary:logistic", eval_metric = "logloss")
xgb.fit  <- xgb.train(param, dtrain, nrounds = 10, watchlist)

To do the confusion matrix, we can convert the predictions into 0 and 1 (based on probability > 0.5) and then pass the table into the confusionMatrix function:

library(caret)
pred = as.numeric(predict(xgb.fit,dtest) >0.5)
obs = getinfo(dtest, "label")

confusionMatrix(table(pred,obs))
Confusion Matrix and Statistics

    obs
pred  0  1
   0 34  0
   1  1 15

               Accuracy : 0.98            
                 95% CI : (0.8935, 0.9995)
    No Information Rate : 0.7             
    P-Value [Acc > NIR] : 4.034e-07       

                  Kappa : 0.9533          

 Mcnemar's Test P-Value : 1               

            Sensitivity : 0.9714          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9375          
             Prevalence : 0.7000          
         Detection Rate : 0.6800          
   Detection Prevalence : 0.6800          
      Balanced Accuracy : 0.9857          

       'Positive' Class : 0

How to predict accuracy of xgboost binary choice model?

1 Answers