I am currently following the slides from the following link. I am on slide 121/128 and I would like to know how to replicate the AUC. The author did not explain how to do so (the same on slide 124 also). Secondly on slide 125 the following code is produced;
bestRound = which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4])
bestRound
I receive the following error;
Error in as.matrix(cv.res)[, 2] : subscript out of bounds
The data for the following code can be downloaded from here and I have produced the code below for your reference.
Question: How can I produce the AUC as the author and why is the subscript out of bounds?
----- Code ------
# Kaggle Winning Solutions
train <- read.csv('train.csv', header = TRUE)
test <- read.csv('test.csv', header = TRUE)
y <- train[, 1]
train <- as.matrix(train[, -1])
test <- as.matrix(test)
train[1, ]
#We want to determin who is more influencial than the other
new.train <- cbind(train[, 12:22], train[, 1:11])
train = rbind(train, new.train)
y <- c(y, 1 - y)
x <- rbind(train, test)
(dat[,i]+lambda)/(dat[,j]+lambda)
A.follow.ratio = calcRatio(x,1,2)
A.mention.ratio = calcRatio(x,4,6)
A.retweet.ratio = calcRatio(x,5,7)
A.follow.post = calcRatio(x,1,8)
A.mention.post = calcRatio(x,4,8)
A.retweet.post = calcRatio(x,5,8)
B.follow.ratio = calcRatio(x,12,13)
B.mention.ratio = calcRatio(x,15,17)
B.retweet.ratio = calcRatio(x,16,18)
B.follow.post = calcRatio(x,12,19)
B.mention.post = calcRatio(x,15,19)
B.retweet.post = calcRatio(x,16,19)
x = cbind(x[,1:11],
A.follow.ratio,A.mention.ratio,A.retweet.ratio,
A.follow.post,A.mention.post,A.retweet.post,
x[,12:22],
B.follow.ratio,B.mention.ratio,B.retweet.ratio,
B.follow.post,B.mention.post,B.retweet.post)
AB.diff = x[,1:17]-x[,18:34]
x = cbind(x,AB.diff)
train = x[1:nrow(train),]
test = x[-(1:nrow(train)),]
set.seed(1024)
cv.res <- xgb.cv(data = train, nfold = 3, label = y, nrounds = 100, verbose = FALSE,
objective = 'binary:logistic', eval_metric = 'auc')
Plot the AUC graph here
set.seed(1024)
cv.res = xgb.cv(data = train, nfold = 3, label = y, nrounds = 3000,
objective='binary:logistic', eval_metric = 'auc',
eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
max_depth = 4, min_child_weight = 1, verbose = F,
subsample = 0.8,colsample_bytree = 0.8)
Here is the break in the code I come across
#bestRound: - subscript out of bounds
bestRound <- which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4])
bestRound
cv.res
cv.res[bestRound,]
set.seed(1024) bst <- xgboost(data = train, label = y, nrounds = 3000,
objective='binary:logistic', eval_metric = 'auc',
eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
max_depth = 4, min_child_weight = 1,
subsample = 0.8,colsample_bytree = 0.8)
preds <- predict(bst,test,ntreelimit = bestRound)
result <- data.frame(Id = 1:nrow(test), Choice = preds)
write.csv(result,'submission.csv',quote=FALSE,row.names=FALSE)