1
votes

I am trying to use XGBoost for binary classification and as a newbie got a problem.

First, I trained model “fit”:

fit <- xgboost(
    data = dtrain #as.matrix(dat[,predictors])
    , label = label 
    #, eta = 0.1                        # step size shrinkage 
    #, max_depth = 25                   # maximum depth of tree 
    , nround=100
    #, subsample = 0.5
    #, colsample_bytree = 0.5           # part of data instances to grow tree
    #, seed = 1
    , eval_metric = "merror"        # or "mlogloss" - evaluation metric 
    , objective = "binary:logistic" #we will train a binary classification model using logistic regression for classification; anoter options: "multi:softprob", "multi:softmax" = multi class classification
    , num_class = 2                 # Number of classes in the dependent variable.
    #, nthread = 3                  # number of threads to be used 
    #, silent = 1
    #, prediction=T
)

Then I trying to use that model for the prediction of the labels for new test data.frame: predictions = predict(fit, as.matrix(test)) print(str(predictions))

As result I am getting 2 times more single probability values than I have in my test data.frame:

num [1:62210] 0.0567 0.0455 0.023 0.0565 0.0642 ...

I read, that since I am using binary classification then for each row in test data.frame I am getting 2 probabilities: for label1 and label2. But how to join that predicted list (or what is the type of that predicted object?) “predictions” with my data.frame “test” and get the predictions with the highest probability? I tried to rbind “predictions” and “test”, but getting 62k rows in the merged data.frame (instead of 31k in the initial “test”). Show me please, how to get prediction for each row?

And the second question: as I am getting in “predictions” 2 probabilities (for label1 and label2) for each row in “test” data.frame then I expected, that the sum of these 2 values should be 1. But as result for 1 test row I am getting 2 small values: 0.0455073267221451 0.0621210783720016 Their sum is much less than 1... Why is that so?

Please, explain me these 2 things. I tried, but did not find any relevant topic with the clear explanations...

1

1 Answers

0
votes

You need first to create the test set, a matrix where you have the p columns used on the training part, without the "outcome" variable (the y of the model).

Keep the vector as.numeric of the labels of the test set (the truth).

Then it's just a couple of istructions. I suggest caret for the confusionMatrix function.

library(caret)
library(xgboost)

test_matrix <- data.matrix(test[, -"outcome")]) # your test matrix (without the labels)
test_labels <- as.numeric(test$outcome) # the test labels
xgb_pred <- predict(fit, test_matrix) # this will give you just one probability (it will be a simple vector)
xgb_pred_class <- as.numeric(xgb_pred > 0.50) # to get your predicted labels 
# keep in mind that 0.50 is a threshold that can be modified.

confusionMatrix(as.factor(xgb_pred_class), as.factor(test_labels))
# this will get your confusion Matrix