Tidymodels: Unnest the RMSE and RSQ Values for the Best Fitted Model from a TIbble after conducting a 10-fold cross validation in R

Question

Overview

I have produced four models using the tidymodels package with the data frame FID (see R-code below):

General Linear Model (glm)
Bagged Tree
Random Forest
Boosted Trees

The data frame contains three predictors:

Year (numeric)
Month (Factor)
Days (numeric)

The dependent variable is Frequency (numeric)

Aim

My aim is to unnest the best-fitted models (i.e. glm, bagged tree, random forest, boosted trees) to display the metrics RMSE and RSQ after conducting a 10 fold cross-validation from a tibble object produced using the function fit_samples().

Example of the tibble

# Resampling results
# 10-fold cross-validation 
# A tibble: 10 x 5
   splits         id     .metrics         .notes           .predictions    
   <list>         <chr>  <list>           <list>           <list>          
 1 <split [24/3]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 2 <split [24/3]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 3 <split [24/3]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 4 <split [24/3]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 5 <split [24/3]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 6 <split [24/3]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 7 <split [24/3]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
 8 <split [25/2]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>
 9 <split [25/2]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>
10 <split [25/2]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>

I would like to visualise the best model (i.e. glm, bagged tree, random forest, boosted trees) by producing plots where the true value is on the x-axis and the predicted value is on the y-axis, as shown in the tutorial and plot figure below.

Tutorial

https://www.tmwr.org/performance.html

When I attempt to predict the fitted model on the test data using the function predict(), I keep on experiencing these error messages in attempt 1 and attempt 2:-

Error Message - Attempt 1

 Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"

Error Message - Attempt 2

Error: `...` is not empty.

We detected these problematic arguments:
* `..1`

These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?

Issue

I am under the impression that I have to unnest the RMSE and RSQ metrics for all fitted models (i.e. glm, bagged trees, random forest, boosted trees) first before I can conduct model predictions with the fitted model on the test data in order to evaluate model effectiveness, or unnest the best model from the range of models examined during the 10 fold cross-validation from the function created for fitting the models.

If anyone is able to please help me fix the issue of predicting the test data on the fitted models using the function predict(), I would be deeply appreciative. I cannot visualise the RMSE and RSQ metrics in a single plot without binding the true and observed values into one data frame to plot using ggplot().

Many thanks in advance.

Plot figure

R-code

Attempt 1

    ##################################################
    ##Model Prediction
    ###################################################
    ##Open the tidymodels package
    library(tidymodels)
    library(tidyverse)
    library(glmnet)
    library(parsnip)
    library(rpart)
    library(tidyverse) # manipulating data
    library(skimr) # data visualization
    library(baguette) # bagged trees
    library(future) # parallel processing & decrease computation time
    library(xgboost) # boosted trees
    library(ranger)
    library(yardstick)
    library(purrr)
    library(forcats)    

###########################################################
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)

###########################################################
##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE,
                              extract = function(x) extract_model(x)))

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
              set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE,
                              extract = function(x) extract_model(x)))
###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE,
                                         extract = function(x) extract_model(x)))

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create Workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE,
                                         extract = function(x) extract_model(x)))

Model Predictions

###################################
##Model Prediction
####################################

  ##glm model

  test_res <- predict(fit_glm, new_data = test_data %>% select(-Frequency))

  ##Error Message

      Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"

##Predicted numeric outcome from the regression model is named .pred. Let’s match 
#the predicted values with their corresponding observed outcome values:

 bind_test_res <- bind_cols(test_res, test_data %>% select(Frequency))

#Note that both the predicted and observed outcomes are in log10 units. 
#It is best practice to analyze the predictions on the transformed scale 
#(if one were used) even if the predictions are reported using the original units.

Plot the data using ggplot():

 ggplot(bind_test_res, aes(x = Frequency, y = .pred)) + 
         # Create a diagonal line:
  geom_abline(lty = 2) + 
  geom_point(alpha = 0.5) + 
  labs(y = "Predicted Frequency (log10)", x = "Frequency (log10)") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

Attempt 2

##split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)

Run our models
# Extract our prepped training data 
# and "bake" our testing data

prep<-prep(rec)

training_baked<-juice(prep)

testing_baked <- prep %>% bake(test_data) 

##glm model
glm_model<-linear_reg(mode="regression",
                      penalty = 0.1, 
                      mixture = 1) %>% 
                      set_engine("glmnet") 
                     

##Create workflow
wflow_glm <- workflow() %>% 
                 add_recipe(prep) %>%
                         add_model(glm_model)
                             
##fit the model                            
fit_glm<- wflow_glm %>% fit(Frequency~Year+Month+Days, data=FID)

##Error Message

Error: `...` is not empty.

We detected these problematic arguments:
* `..1`

These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?

Data frame - FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
"April", "May", "June", "July", "August", "September", "October", 
"November", "December"), class = "factor"), Frequency = c(36, 
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, 
-36L), class = "data.frame")

Looks like you're fitting a variable called fit_glm, then trying to predict on glm_fit. Or am I misreading? — Hobo
I am attempting to predict the fitted model on the test data based on the code in the tutorial — Alice Hobbs
Understood; but (and I could be wrong), it looks like you're doing test_res <- predict(glm_fit, new_data = test_data %>% select(-Frequency)) when you should be doing test_res <- predict(fit_glm, new_data = test_data %>% select(-Frequency)) (based on what you trained - notice the difference in the first argument) — Hobo
That was a copy and past error when I was writing the question. Thank you for bringing my attention to that error. When I corrected the error, I am getting a new error message, which I do not understand. Would you be able to enlighten me? — Alice Hobbs
I'm not familiar enough with tidymodels to be sure, but I think it's because you're fitting with fit_resamples, not just fit. Have a look at Julia's answer here: community.rstudio.com/t/… (I haven't looked at the tutorial you're working through) — Hobo

Alice Hobbs Alice Hobbs · Accepted Answer · 2020-12-14T11:02:05

This answer is inspired by Max Khun

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)

###########################################################
##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################

##Produce the glm model
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

#######################################################################
##MODEL EVALUATION
#######################################################################
##Estimate how well that model performs, let’s fit many times, 
##once to each of these resampled folds, and then evaluate on the heldout 
##part of each resampled fold.
##########################################################################
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##Collect model predictions for each fold for the predictor frequency

Predictions<-fit_glm %>% 
                    collect_predictions()

##Produce a data frame of the Predictions model

Prediction<-as.data.frame(Predictions)

##Open a new plotting window
dev.new()

##Visualise the data by plotting the predicted vs true values
ggplot(Prediction, aes(x = Frequency, y = .pred)) + 
  # Create a diagonal line:
  geom_abline(lty = 2) + 
  geom_point(alpha = 0.5) + 
  labs(y = "Predicted Frequency (log10)", x = "Frequency (log10)") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

Plot

Tidymodels: Unnest the RMSE and RSQ Values for the Best Fitted Model from a TIbble after conducting a 10-fold cross validation in R

1 Answers