1
votes

I am currently following the slides from the following link. I am on slide 121/128 and I would like to know how to replicate the AUC. The author did not explain how to do so (the same on slide 124 also). Secondly on slide 125 the following code is produced;

bestRound = which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4])
bestRound

I receive the following error;

Error in as.matrix(cv.res)[, 2] : subscript out of bounds

The data for the following code can be downloaded from here and I have produced the code below for your reference.

Question: How can I produce the AUC as the author and why is the subscript out of bounds?

----- Code ------

# Kaggle Winning Solutions

train <- read.csv('train.csv', header = TRUE)
test <- read.csv('test.csv', header = TRUE)
y <- train[, 1]
train <- as.matrix(train[, -1])
test <- as.matrix(test)

train[1, ]

#We want to determin who is more influencial than the other

new.train <- cbind(train[, 12:22], train[, 1:11])
train = rbind(train, new.train)
y <- c(y, 1 - y)

x <- rbind(train, test)

(dat[,i]+lambda)/(dat[,j]+lambda)

A.follow.ratio = calcRatio(x,1,2)
A.mention.ratio = calcRatio(x,4,6)
A.retweet.ratio = calcRatio(x,5,7)
A.follow.post = calcRatio(x,1,8)
A.mention.post = calcRatio(x,4,8)
A.retweet.post = calcRatio(x,5,8)
B.follow.ratio = calcRatio(x,12,13)
B.mention.ratio = calcRatio(x,15,17)
B.retweet.ratio = calcRatio(x,16,18)
B.follow.post = calcRatio(x,12,19)
B.mention.post = calcRatio(x,15,19)
B.retweet.post = calcRatio(x,16,19)

x = cbind(x[,1:11],
          A.follow.ratio,A.mention.ratio,A.retweet.ratio,
          A.follow.post,A.mention.post,A.retweet.post,
          x[,12:22],
          B.follow.ratio,B.mention.ratio,B.retweet.ratio,
          B.follow.post,B.mention.post,B.retweet.post)

AB.diff = x[,1:17]-x[,18:34]
x = cbind(x,AB.diff)
train = x[1:nrow(train),]
test = x[-(1:nrow(train)),]

set.seed(1024)
cv.res <- xgb.cv(data = train, nfold = 3, label = y, nrounds = 100, verbose = FALSE,
                 objective = 'binary:logistic', eval_metric = 'auc')

Plot the AUC graph here

set.seed(1024)
cv.res = xgb.cv(data = train, nfold = 3, label = y, nrounds = 3000,
                objective='binary:logistic', eval_metric = 'auc',
                eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
                max_depth = 4, min_child_weight = 1, verbose = F,
                subsample = 0.8,colsample_bytree = 0.8)

Here is the break in the code I come across

#bestRound: -  subscript out of bounds

bestRound <- which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4])
bestRound
cv.res

cv.res[bestRound,]

set.seed(1024) bst <- xgboost(data = train, label = y, nrounds = 3000,
                             objective='binary:logistic', eval_metric = 'auc',
                             eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
                             max_depth = 4, min_child_weight = 1,
                             subsample = 0.8,colsample_bytree = 0.8)
preds <- predict(bst,test,ntreelimit = bestRound)

result <- data.frame(Id = 1:nrow(test), Choice = preds)
write.csv(result,'submission.csv',quote=FALSE,row.names=FALSE)
1

1 Answers

2
votes

Many parts of the code have little sense to me but here is a minimal example of building a model with the provided data:

Data:

train <- read.csv('train.csv', header = TRUE)
y <- train[, 1]
train <- as.matrix(train[, -1])

Model:

library(xgboost)
cv.res <- xgb.cv(data = train, nfold = 3, label = y, nrounds = 100, verbose = FALSE,
                 objective = 'binary:logistic', eval_metric = 'auc', prediction = T)

To obtain cross validation predictions one must specify prediction = T when calling xgb.cv.

To obtain best iteration:

it = which.max(cv.res$evaluation_log$test_auc_mean)
best.iter = cv.res$evaluation_log$iter[it]

to plot ROC curve on the cross validation results:

library(pROC)
plot(pROC::roc(response = y,
               predictor = cv.res$pred,
               levels=c(0, 1)),
     lwd=1.5) 

enter image description here

To get a confusion matrix (assuming 0.5 prob is the threshold):

library(caret)
confusionMatrix(ifelse(cv.res$pred <= 0.5, 0, 1), y)
#output
          Reference
Prediction    0    1
         0 2020  638
         1  678 2164

               Accuracy : 0.7607         
                 95% CI : (0.7492, 0.772)
    No Information Rate : 0.5095         
    P-Value [Acc > NIR] : <2e-16         

                  Kappa : 0.5212         
 Mcnemar's Test P-Value : 0.2823         

            Sensitivity : 0.7487         
            Specificity : 0.7723         
         Pos Pred Value : 0.7600         
         Neg Pred Value : 0.7614         
             Prevalence : 0.4905         
         Detection Rate : 0.3673         
   Detection Prevalence : 0.4833         
      Balanced Accuracy : 0.7605         

       'Positive' Class : 0 

That being said one should aim to tune the hyper-parameters with cross validation such as eta, gamma, lambda, subsample, colsample_bytree, colsample_bylevel etc.

The easiest way is to construct a grid search where you use expand.grid on all combinations of hyper-parameters and use lapply on the grid with xgb.cv as a part of the custom function). If you need more detail please comment.