1
votes

I REALLY like tidymodels, but I'm unclear how I could fit that model workflow on something like a nested group by. As an example, tidyr outlines a simple nest on something like cylinder from mtcars, and then fits a unique linear reg model to each cylinder. I'm trying to fit hundreds of unique models (likely a random forest) based on something like cylinder, but using the tidymodels workflow (data split, recipe, predict).

Here's what is outlined on the tidyr page as a simple nest/fit linear reg:

mtcars_nested <- mtcars %>%
  group_by(cyl) %>%
  nest()

mtcars_nested <- mtcars_nested %>%
  mutate(model = map(data, function(df) lm(mpg ~ wt, data = df)))
mtcars_nested

Is there a way to do something like the below, but based on a group_by or nest attribute in a column? The predictions and/or accuracy would then need to be combined for each and stored in one dataframe, if possible. I tried feeding the data split a nested dataframe, and it didn't work. I feel like this is a purrr map question, but am unclear if it's something tidymodels already supports:

library(tidymodels)
library(tidyverse)

#add dataset
mtcars <- mtcars

#create data splits
split <- initial_split(mtcars)
mtcars_train <- training(split)
mtcars_test <- testing(split)

#create recipe
mtcars_recipe <-
  recipe(mpg ~., data = mtcars_train) %>%
  step_normalize(all_predictors())

#define model
lm_mod <-
  linear_reg(mode = "regression") %>%
  set_engine("lm")

#create workflow that combines recipe & model
mtcars_workflow <-
  workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(mtcars_recipe)

#fit workflow on train data
mtcars_fit <-
  fit(mtcars_workflow, data = mtcars_train)

#predict on test data
predictions <-
predict(mtcars_fit, mtcars_test) 

Appreciate help/advice/direction.

1

1 Answers

5
votes

You can definitely do this if you want! I would set up a function to do all the tidymodels fitting and predicting that you need, and then map() through your nested dataframes.

First define any things that you prefer outside your function, and then create your function.

library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0      ✓ recipes   0.1.13
#> ✓ dials     0.0.8      ✓ rsample   0.0.7 
#> ✓ dplyr     1.0.0      ✓ tibble    3.0.3 
#> ✓ ggplot2   3.3.2      ✓ tidyr     1.1.0 
#> ✓ infer     0.5.3      ✓ tune      0.1.1 
#> ✓ modeldata 0.0.2      ✓ workflows 0.1.2 
#> ✓ parsnip   0.1.2      ✓ yardstick 0.0.7 
#> ✓ purrr     0.3.4
#> ── Conflicts ────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()

## some example data to use
data("hpc_data")

hpc_data <- hpc_data %>%
  select(-protocol, -class)

lm_mod <-
  linear_reg(mode = "regression") %>%
  set_engine("lm")

wf <-
  workflow() %>%
  add_model(lm_mod)

## big function of model fitting and predicting
predict_hpc <- function(df) {
  split <- initial_split(df)
  train_df <- training(split)
  test_df <- testing(split)
  
  #create recipe
  recipe_train <-
    recipe(compounds ~., data = train_df) %>%
    step_normalize(all_predictors())
  
  #fit workflow on train data
  fit_wf <-
    wf %>%
    add_recipe(recipe_train) %>%
    fit(data = train_df)
  
  #predict on test data
  predict(fit_wf, test_df) 
  
}

Now you can nest your data, and then map() over these nested dataframes with your function. It's a good idea to use a adverb like possibly() to catch failures nicely.

hpc_nested <- hpc_data %>%
  group_by(day) %>%
  nest()

hpc_nested %>%
  mutate(predictions = map(data, possibly(predict_hpc, otherwise = NA)))
#> Timing stopped at: 0.001 0 0.001
#> # A tibble: 7 x 3
#> # Groups:   day [7]
#>   day   data               predictions       
#>   <fct> <list>             <list>            
#> 1 Tue   <tibble [900 × 5]> <tibble [225 × 1]>
#> 2 Thu   <tibble [720 × 5]> <tibble [180 × 1]>
#> 3 Fri   <tibble [923 × 5]> <tibble [230 × 1]>
#> 4 Wed   <tibble [903 × 5]> <tibble [225 × 1]>
#> 5 Mon   <tibble [692 × 5]> <tibble [173 × 1]>
#> 6 Sat   <tibble [32 × 5]>  <lgl [1]>         
#> 7 Sun   <tibble [161 × 5]> <tibble [40 × 1]>

Created on 2020-07-18 by the reprex package (v0.3.0)

In this case it failed for Saturday, probably because there was so little data on Saturday to start with.