3
votes

I have a few questions regarding the use of mlr3-pipelines. Indeed, my goal is to create a pipeline that combines three 3 graphs:

1 - A graph to process categorical variables: level imputation => standardization

imp_cat     = po("imputenewlvl", param_vals =list(affect_columns = selector_name(my_cat_variables)))
encode      = po("encode",     param_vals =list(affect_columns = selector_name(my_cat_variables)))
cat = imp_cat %>>% encode

2 - A graph to process a subset of numeric variables: mean imputation => standardization

imp_mean = po("imputemean", param_vals = list(affect_columns =selector_name(my_first_set_of_numeric_variables)))
scale = po("scale", param_vals = list(affect_columns = selector_name(my_first_set_of_numeric_variables)))
num_mean = imp_mean %>>% scale

A third graph to process another subset of numeric variables : median imputation => min max scaling

imp_median = po("imputemedian", param_vals = list(affect_columns =selector_name(my_second_set_of_numeric_variables)))
min_max = po("scalerange", param_vals = list(affect_columns = selector_name(my_second_set_of_numeric_variables)))
num_median = imp_median %>>% min_max

combine these graphs by featureUnion Ops :

graph = po("copy", 3) %>>%
   gunion(list(cat, num_mean, num_median )) %>>%
   po("featureunion")

and finally add learner in GraphLearner :

g1 = GraphLearner$new(graph %>>% po(lrn("classif.ranger")))

I have somme missing values in my data, hence the use of imputers in each graph and i have a binary classification task.

my_task = TaskClassif$new(id="classif", backend = data, target = "my_target")

Theoretically, I shouldn't have missing value errors when I start learning.

g1$train(my_task)

but I have several errors depending on the learner I choose. If I use for example ranger as learner: I have this error

Error: Missing data in columns: ....

if I use svm, glmnet or xgvoost: I have a problem due to the existence of categorical variables. Error : has the following unsupported feature types: factor...

With my pipeline, I shouldn't have a categorical variable and I shouldn't have missing values. so I do not see how to overcome this problem.

1 - I used an imputer in each graph, why do some algorithms tell me that there are always missing values?

2 - How do I remove the categorical variables, once encoded? some algorithms do not support this type of variable

Updated

I think that all the modifications made during the pipeline are not persisted. In other words, the algorithms (svm, ranger, ...), make the train on the original task, and not on the one updated by the pipeline

1

1 Answers

1
votes

Answer to the first question

I will try to explain why there are always missing values in your workflow.

lets load a bunch of packages

library(mlr3) 
library(mlr3pipelines)
library(mlr3learners)
library(mlr3tuning)
library(paradox)

lets take the task pima which has missing values

task <- tsk("pima")
task$missings()
diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227 

since there are no categorical columns I will convert triceps to one:

hb <- po("histbin",
         param_vals =list(affect_columns = selector_name("triceps")))

now to impute new level and encode:

imp_cat <- po("imputenewlvl",
              param_vals =list(affect_columns = selector_name("triceps")))
encode <- po("encode",
             param_vals = list( affect_columns = selector_name("triceps")))

cat <- hb %>>% 
  imp_cat %>>%
  encode

When you use cat on the task:

cat$train(task)[[1]]$data()
#big output

not just the columns you selected to transform are returned but also all the others

This happens also for num_median and num_mean.

Lets create them

imp_mean <- po("imputemean", param_vals = list(affect_columns = selector_name(c("glucose", "mass"))))
scale <- po("scale", param_vals = list(affect_columns = selector_name(c("glucose", "mass"))))
num_mean <- imp_mean %>>% scale


imp_median <- po("imputemedian", param_vals = list(affect_columns = selector_name(c("insulin", "pressure"))))
min_max <- po("scalerange", param_vals = list(affect_columns = selector_name(c("insulin", "pressure"))))
num_median <- imp_median %>>% min_max

check what num_median does

num_median$train(task)[[1]]$data()
#output
     diabetes    insulin  pressure age glucose mass pedigree pregnant triceps
  1:      pos 0.13341346 0.4897959  50     148 33.6    0.627        6      35
  2:      neg 0.13341346 0.4285714  31      85 26.6    0.351        1      29
  3:      pos 0.13341346 0.4081633  32     183 23.3    0.672        8      NA
  4:      neg 0.09615385 0.4285714  21      89 28.1    0.167        1      23
  5:      pos 0.18509615 0.1632653  33     137 43.1    2.288        0      35
 ---                                                                         
764:      neg 0.19951923 0.5306122  63     101 32.9    0.171       10      48
765:      neg 0.13341346 0.4693878  27     122 36.8    0.340        2      27
766:      neg 0.11778846 0.4897959  30     121 26.2    0.245        5      23
767:      pos 0.13341346 0.3673469  47     126 30.1    0.349        1      NA
768:      neg 0.13341346 0.4693878  23      93 30.4    0.315        1      31

So it did what it was supposed to on "insulin" and "pressure" columns but also returned the rest unchanged.

By copying the data three times and applying these three pre processors in each step you return the transformed columns but also all the rest - three times.

What you should do is:

graph <- cat %>>%
  num_mean %>>%
  num_median

cat transforms selected columns and returns all, then num_mean transforms selected columns and returns all...

graph$train(task)[[1]]$data()

looks good to me

And more importantly

g1 <- GraphLearner$new(graph %>>% po(lrn("classif.ranger")))
g1$train(task)

works

2 - The answer to the 2nd question is to use selector functions, specifically in your case

selector_type():

selector_invert(selector_type("factor"))

should do the trick if called prior to piping into the learner.