My goal is to fit a Poisson glmnet using the tidymodels
package. For this purpose, I use the recipes
package to preprocess the data, parsnip
to fit the model, workflows
to bundle the model with the preprocessor and poissonreg
to be able to use Poisson regression with parsnip
. It works perfectly fine if my training dataset only contains numeric predictors, but I'm not able to fit the model when there are some factor (or categorical) predictors. In the code below, you may think that using tidymodels
is overkill. Yes it is for this minimal example, but eventually, I will want to tune my hyperparameters, validate my models, etc. and then, tidymodels
will be useful.
First, let's load the packages we need.
library(tibble)
library(recipes)
library(poissonreg)
library(parsnip)
library(workflows)
library(glmnet)
Let's also simulate our dataset having 1000 rows, 1 outcome (y
), 1 categorical predictor with 2 levels (x_fac
) and 3 numeric predictors (x_num_01
, x_num_02
and x_num_03
).
n <- 1000
dat <- tibble::tibble(
y = rpois(n, lambda = 0.15),
x_fac = factor(sample(c("M", "F"), size = n, replace = T)),
x_num_01 = rnorm(n),
x_num_02 = rnorm(n),
x_num_03 = rnorm(n)
)
Then, we define and prepare the recipe. The preprocessing is very simple: all categorical predictors are transformed to dummy predictors if there are any.
rec <-
recipes::recipe(y ~ ., data = dat) %>%
recipes::step_dummy(all_nominal()) %>%
recipes::prep()
Then we define our model,
glmnet_mod <-
poissonreg::poisson_reg(penalty = 0.01, mixture = 1) %>%
parsnip::set_engine("glmnet")
bundle the model and the preprocessor together with the workflows
package
glmnet_wf <-
workflows::workflow() %>%
workflows::add_recipe(rec) %>%
workflows::add_model(glmnet_mod)
and finally, we train the model with parsnip
:
glmnet_fit <-
glmnet_wf %>%
parsnip::fit(data = dat)
This parsnip::fit
function throws the error
Error in fishnet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NA/NaN/Inf in foreign function call (arg 4)
In addition: Warning message:
In fishnet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NAs introduced by coercion
Timing stopped at: 0.005 0 0.006
and I have absolutely no idea why! If you remove the predictor x_fac
from the simulated dataset dat
, it works fine. It also works if I preprocess the data by myself before running a glmnet with the glmnet
package:
x <- dat %>% dplyr::mutate(x_fac_M = x_fac == "M") %>% dplyr::select(contains("x"), -x_fac) %>% as.matrix()
y <- dat$y
glmnet::glmnet(x = x, y = y, family = "poisson", lambda = 0.01, alpha = 1)
Thanks for your help!
Session info:
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflows_0.1.1 poissonreg_0.0.1 parsnip_0.1.0 recipes_0.1.12
[5] dplyr_0.8.5 tibble_3.0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 pillar_1.4.4 compiler_4.0.0 gower_0.2.1
[5] iterators_1.0.12 class_7.3-16 tools_4.0.0 rpart_4.1-15
[9] ipred_0.9-9 packrat_0.5.0 lubridate_1.7.8 lifecycle_0.2.0
[13] lattice_0.20-41 pkgconfig_2.0.3 rlang_0.4.6 foreach_1.5.0
[17] Matrix_1.2-18 cli_2.0.2 rstudioapi_0.11 prodlim_2019.11.13
[21] withr_2.2.0 generics_0.0.2 vctrs_0.2.4 glmnet_3.0-2
[25] grid_4.0.0 nnet_7.3-13 tidyselect_1.0.0 glue_1.4.0
[29] R6_2.4.1 fansi_0.4.1 survival_3.1-12 lava_1.6.7
[33] purrr_0.3.4 tidyr_1.0.2 magrittr_1.5 codetools_0.2-16
[37] ellipsis_0.3.0 MASS_7.3-51.5 splines_4.0.0 hardhat_0.1.2
[41] assertthat_0.2.1 shape_1.4.4 timeDate_3043.102 utf8_1.1.4
[45] crayon_1.3.4