4
votes

I want to perform penalty selection for the LASSO algorithm and predict outcomes using tidymodels. I will use the Boston housing dataset to illustrate the problem.

library(tidymodels)
library(tidyverse)
library(mlbench)

data("BostonHousing")

dt <- BostonHousing

I first split the dataset into train/test subsets.

dt_split <- initial_split(dt)
dt_train <- training(dt_split)
dt_test <- testing(dt_split)

Defining preprocessing using the recipe package.

rec <- recipe(medv ~ ., data = dt_train) %>%
  step_center(all_predictors(), -all_nominal()) %>% 
  step_dummy(all_nominal()) %>% 
  prep()

Initialization of the model and the workflow. I use the glmnet engine. mixture = 1 means that I choose the LASSO penalty and penalty = tune() means that I will later use cross-validation to choose the best penalty parameter lambda.

lasso_mod <- linear_reg(mode = "regression",
                        penalty = tune(),
                        mixture = 1) %>% 
  set_engine("glmnet")

wf <- workflow() %>%
  add_model(lasso_mod) %>%
  add_recipe(rec)

Preparing stratified 5-fold cross-validation and the penalty grid:

folds <- rsample::vfold_cv(dt_train, v = 5, strata = medv, nbreaks = 5)
my_grid <- tibble(penalty = 10^seq(-2, -1, length.out = 10))

Let us run the cross-validation:

my_res <- wf %>% 
  tune_grid(resamples = folds,
            grid = my_grid,
            control = control_grid(verbose = FALSE, save_pred = TRUE),
            metrics = metric_set(rmse))

I'm now able to get the best penalty from the grid and update my workflow for this optimal penalty:

best_mod <- my_res %>% select_best("rmse")
print(best_mod)

final_wf <- finalize_workflow(wf, best_mod)
print(final_wf)

== Workflow ===================================================================================================================
Preprocessor: Recipe
Model: linear_reg()

-- Preprocessor ---------------------------------------------------------------------------------------------------------------
2 Recipe Steps

* step_center()
* step_dummy()

-- Model ----------------------------------------------------------------------------------------------------------------------
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.0278255940220712
  mixture = 1

Computational engine: glmnet 

So far so good. Now I want to apply the workflow to the training data to get my final model:

final_mod <- fit(final_wf, data = dt_train) %>%
  pull_workflow_fit()

Now here is the issue.

final_mod$fit is a elnet and glmnet object. It contains the full regularization path on a grid of 75 values of the penalty parameter. Hence the previous penalty tuning step is pretty much useless. So the prediction step fails:

predict(final_mod, new_data = dt) returns an error:

Error in cbind2(1, newx) %*% nbeta : 
  invalid class 'NA' to dup_mMatrix_as_dgeMatrix

Of course I could use glmnet::cv.glmnet to get the best penalty and then use the method predict.cv.glmnet but I need a generic workflow able to work with multiple machine learning models using the same interface. In the documentation of parsnip::linear_reg there is this note regarding the glmnet engine:

For glmnet models, the full regularization path is always fit regardless of the value given to penalty. Also, there is the option to pass multiple values (or no values) to the penalty argument. When using the predict() method in these cases, the return value depends on the value of penalty. When using predict(), only a single value of the penalty can be used. When predicting on multiple penalties, the multi_predict() function can be used. It returns a tibble with a list column called .pred that contains a tibble with all of the penalty results.

However, I do not understand how I should proceed to get the predictions of the tuned LASSO model using the tidymodels framework. The multi_predict function throws the same error as predict.

1
Please update your question to include the exact error returned by predict(final_mod, new_data = dt).desertnaut

1 Answers

4
votes

You are really close here to having everything working right.

Let's read in the data, split it into training/testing and create resampling folds.

library(tidymodels)
library(tidyverse)
library(mlbench)

data("BostonHousing")

dt <- BostonHousing

dt_split <- initial_split(dt)
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
folds <- vfold_cv(dt_train, v = 5, strata = medv, nbreaks = 5)

Now let's create a preprocessing recipe. (Notice that you don't need to prep() it if you are using a workflow(); that can get slow if your data is big so might as well not do it until the workflow() takes care of it for you later.)

rec <- recipe(medv ~ ., data = dt_train) %>%
    step_center(all_predictors(), -all_nominal()) %>% 
    step_dummy(all_nominal())

Now let's make our model, put it together with our recipe in a workflow(), and tune the workflow with a grid.

lasso_mod <- linear_reg(mode = "regression",
                        penalty = tune(),
                        mixture = 1) %>% 
    set_engine("glmnet")

wf <- workflow() %>%
    add_model(lasso_mod) %>%
    add_recipe(rec)

my_grid <- tibble(penalty = 10^seq(-2, -1, length.out = 10))

my_res <- wf %>% 
    tune_grid(resamples = folds,
              grid = my_grid,
              control = control_grid(verbose = FALSE, save_pred = TRUE),
              metrics = metric_set(rmse))

This is the best penalty we got:

best_mod <- my_res %>% select_best("rmse")
best_mod
#> # A tibble: 1 x 2
#>   penalty .config              
#>     <dbl> <chr>                
#> 1  0.0215 Preprocessor1_Model04

Here is where we do things a little bit differently than what you did. I am going to finalize my workflow with the best penalty and then fit that finalized workflow to the training data. The output here is a fitted workflow. I do not want to pull the underlying model out of this, because the model needs the preprocessing to work correctly; it was trained expecting the preprocessing to happen.

Instead, I can just predict() straight on that trained workflow:

final_fitted <- finalize_workflow(wf, best_mod) %>%
    fit(data = dt_train)

predict(final_fitted, dt_train)
#> # A tibble: 379 x 1
#>    .pred
#>    <dbl>
#>  1  18.5
#>  2  24.2
#>  3  23.3
#>  4  21.6
#>  5  37.6
#>  6  21.5
#>  7  16.7
#>  8  15.6
#>  9  21.3
#> 10  21.3
#> # … with 369 more rows
predict(final_fitted, dt_test)
#> # A tibble: 127 x 1
#>    .pred
#>    <dbl>
#>  1  30.2
#>  2  25.1
#>  3  19.6
#>  4  17.0
#>  5  13.9
#>  6  15.4
#>  7  13.7
#>  8  20.8
#>  9  31.1
#> 10  21.3
#> # … with 117 more rows

Created on 2021-03-16 by the reprex package (v1.0.0)

If you tune a workflow, then you generally want to finalize, fit, and predict a workflow. Exceptions might be if you use a really simple preprocessor in your workflow like a formula that you can pass to fit(); I show an example that you could do that with here.