1
votes

Goal

I'd like to implement a LASSO model and check its viability on a training set according to the schematic shown here. (Schematic destription: all data is split into testing and training sets. The training set is split via 5-fold cross-validation (CV) into resamples. where 10-fold CV is performed on each resample to find optimal lambdas.) The testing set is not available yet.

I'd like to LASSO model and check its performance using nested CV with inner CV to obtain optimal lambda (analysis and assessment) via a grid searches and outer CV to compare resamples 1, 2, 3 etc.

Caret with 'repeatedcv'

The train-control with 'repeatedcv' from Caret allows to specify number and repeats.

lambdas = 10^seq(-3, -1, length = 20)
trControl = trainControl(
      'repeatedcv', 
      number = 10, 
      repeats = 5, 
      search = 'grid'
)
tuneGrid = expand.grid(alpha = 1, lambda = lambdas)
lasso = train(
       PD ~ ., data = selection, 
       method = 'glmnet',
       trControl = trControl, 
       tuneGrid = tuneGrid
)

lasso$results

With the code above, caret results is a dataframe with 20 rows. Presumably one row for each point on the defined grid. However, I'd like caret to find one optimal lambda per grid search using 10-fold (number = 10) CV and then compare the optimal lambdas, as the bold process was performed multiple times (repeats = 5).

1

1 Answers

1
votes

You can implement nested resampling in tidymodels using nested_cv():

library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6      ✓ recipes   0.1.12
#> ✓ dials     0.0.7      ✓ rsample   0.0.7 
#> ✓ dplyr     1.0.0      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.1      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.1 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.6 
#> ✓ purrr     0.3.4
#> ── Conflicts ──────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()

results <- nested_cv(mtcars, 
                     outside = vfold_cv(repeats = 5), 
                     inside = bootstraps(times = 25))
results
#> # Nested resampling:
#> #  outer: 10-fold cross-validation repeated 5 times
#> #  inner: Bootstrap sampling
#> # A tibble: 50 x 4
#>    splits         id      id2    inner_resamples  
#>    <list>         <chr>   <chr>  <list>           
#>  1 <split [28/4]> Repeat1 Fold01 <tibble [25 × 2]>
#>  2 <split [28/4]> Repeat1 Fold02 <tibble [25 × 2]>
#>  3 <split [29/3]> Repeat1 Fold03 <tibble [25 × 2]>
#>  4 <split [29/3]> Repeat1 Fold04 <tibble [25 × 2]>
#>  5 <split [29/3]> Repeat1 Fold05 <tibble [25 × 2]>
#>  6 <split [29/3]> Repeat1 Fold06 <tibble [25 × 2]>
#>  7 <split [29/3]> Repeat1 Fold07 <tibble [25 × 2]>
#>  8 <split [29/3]> Repeat1 Fold08 <tibble [25 × 2]>
#>  9 <split [29/3]> Repeat1 Fold09 <tibble [25 × 2]>
#> 10 <split [29/3]> Repeat1 Fold10 <tibble [25 × 2]>
#> # … with 40 more rows

Created on 2020-06-11 by the reprex package (v0.3.0.9001)

For an outline of how to train and assess models on nested resamples like this, check out this article on tidymodels.org.