0
votes

A two-part question: I'm trying to figure out:
(1) how to generate a ROC curve for a linear regression using lm() (properly, if it's even right??), and
(2) how to do it with k-fold cross validation so I may get the mean ROC curve (and AUC).

If the outcome is a continuous variable, it has to be converted into a binary variable, right? Normally I would fit a logistic regression model using glm(..., family = 'binomial') instead, but is it the most appropriate way? (It seems like I'm just fitting a different model.)

I would like something like this plot below from the cvAUC package's rdrr.io website (red line is the mean ROC curve, dotted lines are k-fold ROC curves), but I'm not sure how to get there with my data.

10-fold CV ROC

Example with data(USArrests):

library(dplyr)
library(pROC)
data(USArrests)

# create train and test sets
set.seed(2021)
dat <- mutate(USArrests, index=1:nrow(USArrests))
train.dat <- sample_frac(dat, 0.5) # splits `dat` in half
test.dat <- subset(dat, !dat$index %in% train.dat$index) # uses other half to test

# trying to build predictions with lm()
fit <- lm(Murder ~ Assault, data = train.dat)
predicted <- predict(fit, test.dat, type = "response")

# roc curve
roc(test.dat$Murder ~ predicted, plot = TRUE, print.auc = TRUE) # AUC = 1.000

The code above gets results, but gives a warning:

Warning message: In roc.default(response, m[[predictors]], ...) : 'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead

I don't know what to do from its suggestion. It also got an AUC = 1.000 -- is this approach wrong, and why?

Moreover, it's only working with one train/test set. I'm not sure how to train with k-fold sets. I think I have to combine it with caret::train() somehow. I tried with the ROC solutions for random forest models from ROC curve from training data in caret, but it is not working with my code.

Example:

library(caret)
library(MLeval)

train_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfFit <- train(Murder ~ Assault, data = USArrests, trControl = train_control, method = "lm")

rfFit$pred$mtry # NULL
res <- MLeval::evalm(rfFit) # error with error message below

MLeval: Machine Learning Model Evaluation
Input: caret train function object
Not averaging probs.
Group 1 type: cv
Error in [.data.frame(preds, c(G1, G2, "obs")) : undefined columns selected

1
The approach you're taking is wrong IMO - the AUC is generated from trying to classify a binary (or multiple-category) variable with a continuous predictor (e.g., predictions from a GLM). If you're interested in the predictive accuracy of a linear model, why not use something like the R-squared?DaveArmstrong
@DaveArmstrong, yeah, I thought it was wrong till I saw this RPubs (tutorial?); section 4.2, which introduces doing "ROC for linear regression." I was also asked for ROC curves for my linear regression results, which is why I'm trying to figure this out.LC-datascientist
If I change the lm() to glm() and create a binary outcome variable, how should I do its ROC with k-fold cross-validation?LC-datascientist
I proposed a solution below.DaveArmstrong

1 Answers

1
votes

You could do the cross-validation like this if you switched it to a 0/1 variable:

USArrests <- USArrests %>% 
  mutate(Murder01 = as.numeric(Murder > mean(Murder, na.rm=TRUE)))

# create train and test sets
set.seed(2021)
cvfun <- function(split, ...){
  mod <- glm(Murder01 ~ Assault, data=analysis(split), family=binomial)
  fit <- predict(mod, newdata=assessment(split), type="response")
  data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}
library(rsample)
library(purrr)
library(tidyverse)
cv_out <- vfold_cv(USArrests, v=10, repeats = 5) %>% 
    mutate(fit = map(splits, cvfun)) %>% 
    unnest(fit) %>% 
    group_by(id) %>% 
    summarise(auc = roc(y, fit, plot=FALSE)$auc[1])

cv_out
# # A tibble: 5 x 2
#   id        auc
# * <chr>   <dbl>
# 1 Repeat1 0.936
# 2 Repeat2 0.928
# 3 Repeat3 0.937
# 4 Repeat4 0.918
# 5 Repeat5 0.942

That said, I'm not sure this is better than using something like the R-squared or MSE on the linear model. And, I'm not not super confident that the algorithm in the tutorial is actually doing something that makes sense statistically. I could definitely be wrong and would defer to someone with more expertise, but I can't see how it makes a lot of sense and it certainly doesn't produce something meaningful in this case. An AUC of 1 you would think would only happen with perfect prediction.

Further, I'm not sure what probative value these numbers have. Generally you would want to use this sort of analysis to tune the model specification - often by finding nearly optimal values of hyper-parameters. You could imagine doing this with a different model specification. For example, you could evaluate the relative predictive power of a model with a second-degree polynomial in Assault versus one that was linear, as below.

cvfun2 <- function(split, ...){
  mod <- glm(Murder01 ~ poly(Assault, 2),  data=analysis(split), family=binomial)
  fit <- predict(mod, newdata=assessment(split), type="response")
  data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}

cv_out2 <- vfold_cv(USArrests, v=10, repeats = 5) %>% 
    mutate(fit = map(splits, cvfun2)) %>% 
    unnest(fit) %>% 
    group_by(id) %>% 
    summarise(auc = roc(y, fit, plot=FALSE)$auc[1])

mean(cv_out2$auc)
# [1] 0.9123994
mean(cv_out$auc)
# [1] 0.9320451

Edit - Making the ROC plot

cv_out_plot <- vfold_cv(USArrests, v=10, repeats = 5) %>% 
  mutate(fit = map(splits, cvfun)) %>% 
  unnest(fit) %>% 
  group_by(id) %>% 
  summarise(sens = roc(y, fit, plot=FALSE)$sensitivities, 
         spec = roc(y, fit, plot=FALSE)$specificities, 
         obs = 1:length(sens))
ave <- cv_out_plot %>% 
  ungroup %>% 
  group_by(obs) %>% 
  summarise(sens = mean(sens), 
            spec = mean(spec), 
            id = "Average")

cv_out_plot <- bind_rows(cv_out_plot, ave) %>% 
  mutate(col = factor(ifelse(id == "Average", "Average", "Individual"), 
                      levels=c("Individual", "Average")))



ggplot(cv_out_plot , aes(x=1-sens, y=spec, group=id, colour=col)) + 
  geom_line(aes(size=col, alpha=col)) + 
  scale_colour_manual(values=c("black", "red")) + 
  scale_size_manual(values=c(.5,1.25)) + 
  scale_alpha_manual(values=c(.3, 1)) + 
  theme_classic() + 
  theme(legend.position=c(.75, .15)) + 
  labs(x="1-Sensitivity", y="Specificity", colour="", alpha="", size="")

enter image description here