0
votes

The df is splitted in the train and test dataframes. the train dataframe is splitted in training and testing dataframes. The dependent variable Y is binary (factor) with values 0 and 1. I'm trying to predict the probability with this code (neural networks, caret package):

library(caret)

model_nn <- train(
  Y ~ ., training,
  method = "nnet",
  metric="ROC",
  trControl = trainControl(
    method = "cv", number = 10,
    verboseIter = TRUE,
    classProbs=TRUE
  )
)

model_nn_v2 <- model_nn
nnprediction <- predict(model_nn, testing, type="prob")
cmnn <-confusionMatrix(nnprediction,testing$Y)
print(cmnn) # The confusion matrix is to assess/compare the model

However, it gives me this error:

    Error: At least one of the class levels is not a valid R variable name; 
This will cause errors when class probabilities are generated because the
 variables names will be converted to  X0, X1 . Please use factor levels 
that can be used as valid R variable names  (see ?make.names for help).

I don't understand what means "use factor levels that can be used as valid R variable names". The dependent variable Y is already a factor, but is not a valid R variable name?.

PS: The code works perfectly if you erase classProbs=TRUE in trainControl() and metric="ROC" in train(). However, the "ROC" metric is my metric of comparison for the best model in my case, so I'm trying to make a model with "ROC" metric.

EDIT: Code example:

# You have to run all of this BEFORE running the model
classes <- c("a","b","b","c","c")
floats <- c(1.5,2.3,6.4,2.3,12)
dummy <- c(1,0,1,1,0)
chr <- c("1","2","2,","3","4")
Y <- c("1","0","1","1","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)

classes <- c("a","a","a","b","c")
floats <- c(5.5,2.6,7.3,54,2.1)
dummy <- c(0,0,0,1,1)
chr <- c("3","3","3,","2","1")
Y <- c("1","1","1","0","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
1

1 Answers

7
votes

There are two separate issues here.

The first is the error message, which says it all: you have to use something else than "0", "1" as values for your dependent factor variable Y.

You can do this by at least two ways, after you have built your dataframe df; the first one is hinted at the error message, i.e. use make.names:

df$Y <- make.names(df$Y)
df$Y
# "X1" "X1" "X1" "X0" "X0"

The second way is to use the levels function, by which you will have explicit control over the names themselves; showing it here again with names X0 and X1

levels(df$Y) <- c("X0", "X1")
df$Y
# [1] X1 X1 X1 X0 X0
# Levels: X0 X1

After adding either one of the above lines, the shown train() code will run smoothly (replacing training with df), but it will still not produce any ROC values, giving instead the warning:

Warning messages:
1: In train.default(x, y, weights = w, ...) :
  The metric "ROC" was not in the result set. Accuracy will be used instead.

which bring us to the second issue here: in order to use the ROC metric, you have to add summaryFunction = twoClassSummary in the trControlargument of train():

model_nn <- train(
  Y ~ ., df,
  method = "nnet",
  metric="ROC",
  trControl = trainControl(
    method = "cv", number = 10,
    verboseIter = TRUE,
    classProbs=TRUE,
    summaryFunction = twoClassSummary # ADDED
  )
)

Running the above snippet with the toy data you have provided still gives an error (missing ROC values), but probably this is due to the very small dataset used here combined with the large number of CV folds, and it will not happen with your own, full dataset (it works OK if I reduce the CV folds to number=3)...