1
votes

I'm using R caret to do classification. And I got the following error message when training:

Error in train.default(train[, predictorsNames], train[, outcomeName], : Class probabilities are needed to score models using the area under the ROC curve. Set classProbs = TRUE in the trainControl() function.

I did some searching on this problem. The following two links are discussing on similar issues. Error when I try to predict class probabilities in R - caret and R caret train Error in evalSummaryFunction: cannnot compute class probabilities for regression According to the answers given, the problem may be caused by not defining outcomeName as a factor or invalid level names. But I've already converted outcomeName to a factor, tried different level names and set classProbs=TRUE and it still doesn't work.

library(caret)
library(gbm)

The data set I used is dat , which has 6 variables. I need to do classification on the variable "FlagD60".

> dput(droplevels(head(dat,5)))
structure(list(FICO = c(689L, 689L, 689L, 783L, 783L), Line = c(4000.001686, 
3700.002962, 3600.001866, 14500.00101, 5262.002105), Balance = c(1686L, 
2962L, 1866L, 1014L, 2105L), Payment = c(53L, 79L, 33L, 21L, 
15L), Age = c(6L, 81L, 82L, 235L, 57L), FlagD60 = c(0L, 0L, 0L, 
0L, 0L)), .Names = c("FICO", "Line", "Balance", "Payment", "Age", 
"FlagD60"), row.names = c(NA, 5L), class = "data.frame")

I generated a new factor with levels "yes" and "no" for classification and split the data. Since I don't know whether the error comes this preparation stage, I left it for your reference too.

### prepare for classification ###
outcomeName <- 'FlagD60'
predictorsNames <- names(dat)[names(dat) != outcomeName]
dat$FlagD60b=ifelse(dat$FlagD60==1,'yes','no')
dat$FlagD60b=as.factor(dat$FlagD60b)
outcomeName='FlagD60b'

trainIndex=createDataPartition(dat[,outcomeName],p=0.75,list = 
                               FALSE,times=1)
train=dat[ trainIndex,]
test =dat[-trainIndex,]

Below is the result of levels(train$FlagD60b).

[1] "no"  "yes"

Then I built the model like this.

#### repeated 10-fold CV, grid, gbm ####
ctrl=trainControl(method = "repeatedcv",number = 10,repeats = 10, 
                  summaryFunction = twoClassSummary, 
                  classProbs = TRUE)

set.seed(520)
gbmfit=train(train[,predictorsNames], train[,outcomeName],
             method="gbm",
             trcontrol=ctrl,
             verbose=FALSE, 
             metric="ROC")

And this gives the error as I said above. Any ideas from you will be really appreciated.

And the output of sessionInfo() is also included for your reference.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.4      gbm_2.1.3       survival_2.39-4 caret_6.0-73    ggplot2_2.2.1   lattice_0.20-34

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9        magrittr_1.5       MASS_7.3-45        munsell_0.4.3      colorspace_1.3-2  
 [6] foreach_1.4.3      minqa_1.2.4        stringr_1.2.0      car_2.1-4          tools_3.3.1       
[11] nnet_7.3-12        pbkrtest_0.4-7     grid_3.3.1         gtable_0.2.0       nlme_3.1-128      
[16] mgcv_1.8-12        quantreg_5.29      MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-12       
[21] lazyeval_0.2.0     assertthat_0.1     tibble_1.2         Matrix_1.2-6       nloptr_1.0.4      
[26] reshape2_1.4.2     ModelMetrics_1.1.0 codetools_0.2-14   stringi_1.1.2      scales_0.4.1      
[31] stats4_3.3.1       SparseM_1.76   
1
Can you make a small subset of your data and share it? Or, ideally, simulate some data which yields the same result. See here on how to share your data in least painful manner.Roman Luštrik
@RomanLuštrik I've added a subset of the train data and the output of sessionInfo(). Hope this could help.Allen
OK, excellent. Now water down the code and show only the necessary pieces needed to reproduce the error.Roman Luštrik
@RomanLuštrik I deleted as much as possible, left only the data preparation, traincontrol and train parts where the error might come from.Allen
Please copy/paste the code in your fresh R session and confirm that you are getting the same error. For instance, we don't have access to dat to create trainIndex.Roman Luštrik

1 Answers

1
votes

I just had the same problem. I believe the issue is the parameter name in the train function should be trControl instead of trcontrol. Upcase C!