Overview
I have produced four models using the tidymodels package with the data frame FID (see R-code below):
- General Linear Model (glm)
- Bagged Tree
- Random Forest
- Boosted Trees
The data frame contains three predictors:
- Year (numeric)
- Month (Factor)
- Days (numeric)
The dependent variable is Frequency (numeric)
Aim
My aim is to unnest the best-fitted models (i.e. glm, bagged tree, random forest, boosted trees) to display the metrics RMSE and RSQ after conducting a 10 fold cross-validation from a tibble object produced using the function fit_samples().
Example of the tibble
# Resampling results
# 10-fold cross-validation
# A tibble: 10 x 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [24/3]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
2 <split [24/3]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
3 <split [24/3]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
4 <split [24/3]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
5 <split [24/3]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
6 <split [24/3]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
7 <split [24/3]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [3 × 3]>
8 <split [25/2]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>
9 <split [25/2]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>
10 <split [25/2]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]> <tibble [2 × 3]>
I would like to visualise the best model (i.e. glm, bagged tree, random forest, boosted trees) by producing plots where the true value is on the x-axis and the predicted value is on the y-axis, as shown in the tutorial and plot figure below.
Tutorial
https://www.tmwr.org/performance.html
When I attempt to predict the fitted model on the test data using the function predict(), I keep on experiencing these error messages in attempt 1 and attempt 2:-
Error Message - Attempt 1
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"
Error Message - Attempt 2
Error: `...` is not empty.
We detected these problematic arguments:
* `..1`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
Issue
I am under the impression that I have to unnest the RMSE and RSQ metrics for all fitted models (i.e. glm, bagged trees, random forest, boosted trees) first before I can conduct model predictions with the fitted model on the test data in order to evaluate model effectiveness, or unnest the best model from the range of models examined during the 10 fold cross-validation from the function created for fitting the models.
If anyone is able to please help me fix the issue of predicting the test data on the fitted models using the function predict(), I would be deeply appreciative. I cannot visualise the RMSE and RSQ metrics in a single plot without binding the true and observed values into one data frame to plot using ggplot().
Many thanks in advance.
Plot figure
R-code
Attempt 1
##################################################
##Model Prediction
###################################################
##Open the tidymodels package
library(tidymodels)
library(tidyverse)
library(glmnet)
library(parsnip)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
###########################################################
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
###########################################################
##Create Models
###########################################################
##########################################################
##General Linear Models
#########################################################
##glm
mod_glm<-linear_reg(mode="regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
##Create workflow
wflow_glm <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_glm)
##Fit the model
plan(multisession)
fit_glm <- fit_resamples(
wflow_glm,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
##########################################################
##Bagged Trees
##########################################################
#####Bagged Trees
mod_bag <- bag_tree() %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
##Create workflow
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_bag)
##Fit the model
plan(multisession)
fit_bag <- fit_resamples(
wflow_bag,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
###################################################
##Random forests
###################################################
mod_rf <-rand_forest(trees = 1e3) %>%
set_engine("ranger",
num.threads = parallel::detectCores(),
importance = "permutation",
verbose = TRUE) %>%
set_mode("regression")
##Create Workflow
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
##Fit the model
plan(multisession)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
############################################################
##Boosted Trees
############################################################
mod_boost <- boost_tree() %>%
set_engine("xgboost", nthreads = parallel::detectCores()) %>%
set_mode("regression")
##Create Workflow
wflow_boost <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_boost)
##Fit model
plan(multisession)
fit_boost <-fit_resamples(
wflow_boost,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
Model Predictions
###################################
##Model Prediction
####################################
##glm model
test_res <- predict(fit_glm, new_data = test_data %>% select(-Frequency))
##Error Message
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"
##Predicted numeric outcome from the regression model is named .pred. Let’s match
#the predicted values with their corresponding observed outcome values:
bind_test_res <- bind_cols(test_res, test_data %>% select(Frequency))
#Note that both the predicted and observed outcomes are in log10 units.
#It is best practice to analyze the predictions on the transformed scale
#(if one were used) even if the predictions are reported using the original units.
Plot the data using ggplot():
ggplot(bind_test_res, aes(x = Frequency, y = .pred)) +
# Create a diagonal line:
geom_abline(lty = 2) +
geom_point(alpha = 0.5) +
labs(y = "Predicted Frequency (log10)", x = "Frequency (log10)") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
Attempt 2
##split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
Run our models
# Extract our prepped training data
# and "bake" our testing data
prep<-prep(rec)
training_baked<-juice(prep)
testing_baked <- prep %>% bake(test_data)
##glm model
glm_model<-linear_reg(mode="regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
##Create workflow
wflow_glm <- workflow() %>%
add_recipe(prep) %>%
add_model(glm_model)
##fit the model
fit_glm<- wflow_glm %>% fit(Frequency~Year+Month+Days, data=FID)
##Error Message
Error: `...` is not empty.
We detected these problematic arguments:
* `..1`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
Data frame - FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Frequency = c(36,
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9,
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27,
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15,
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31,
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA,
-36L), class = "data.frame")
fit_glm
, then trying to predict onglm_fit
. Or am I misreading? – Hobotest_res <- predict(glm_fit, new_data = test_data %>% select(-Frequency))
when you should be doingtest_res <- predict(fit_glm, new_data = test_data %>% select(-Frequency))
(based on what you trained - notice the difference in the first argument) – Hobofit_resamples
, not justfit
. Have a look at Julia's answer here: community.rstudio.com/t/… (I haven't looked at the tutorial you're working through) – Hobo