A two-part question: I'm trying to figure out:
(1) how to generate a ROC curve for a linear regression using lm()
(properly, if it's even right??), and
(2) how to do it with k-fold cross validation so I may get the mean ROC curve (and AUC).
If the outcome is a continuous variable, it has to be converted into a binary variable, right? Normally I would fit a logistic regression model using glm(..., family = 'binomial')
instead, but is it the most appropriate way? (It seems like I'm just fitting a different model.)
I would like something like this plot below from the cvAUC
package's rdrr.io website (red line is the mean ROC curve, dotted lines are k-fold ROC curves), but I'm not sure how to get there with my data.
Example with data(USArrests)
:
library(dplyr)
library(pROC)
data(USArrests)
# create train and test sets
set.seed(2021)
dat <- mutate(USArrests, index=1:nrow(USArrests))
train.dat <- sample_frac(dat, 0.5) # splits `dat` in half
test.dat <- subset(dat, !dat$index %in% train.dat$index) # uses other half to test
# trying to build predictions with lm()
fit <- lm(Murder ~ Assault, data = train.dat)
predicted <- predict(fit, test.dat, type = "response")
# roc curve
roc(test.dat$Murder ~ predicted, plot = TRUE, print.auc = TRUE) # AUC = 1.000
The code above gets results, but gives a warning:
Warning message: In roc.default(response, m[[predictors]], ...) : 'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead
I don't know what to do from its suggestion. It also got an AUC = 1.000 -- is this approach wrong, and why?
Moreover, it's only working with one train/test set. I'm not sure how to train with k-fold sets. I think I have to combine it with caret::train()
somehow. I tried with the ROC solutions for random forest models from ROC curve from training data in caret, but it is not working with my code.
Example:
library(caret)
library(MLeval)
train_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfFit <- train(Murder ~ Assault, data = USArrests, trControl = train_control, method = "lm")
rfFit$pred$mtry # NULL
res <- MLeval::evalm(rfFit) # error with error message below
MLeval: Machine Learning Model Evaluation
Input: caret train function object
Not averaging probs.
Group 1 type: cv
Error in[.data.frame
(preds, c(G1, G2, "obs")) : undefined columns selected
lm()
toglm()
and create a binary outcome variable, how should I do its ROC with k-fold cross-validation? – LC-datascientist