To handle new and unseen labels in a spark ml pipeline I want to use most frequent imputation. if the pipeline consists of 3 steps
- preprocessing
- learn most frequent item
- stringIndexer for each categorical column
- vector assembler
- estimator e.g. random forest
Assuming (1) and (2,3) and (4,5) constitute separate pipelines
- I can fit and transform 1 for train and test data. This means all nan values were handled, i.e. imputed
- 2,3 will fit nicely as well as 4,5
- then I can use
the following
val fittedLabels = pipeline23.stages collect { case a: StringIndexerModel => a }
val result = categoricalColumns.zipWithIndex.foldLeft(validationData) {
(currentDF, colName) =>
currentDF
.withColumn(colName._1, when(currentDF(colName._1) isin (fittedLabels(colName._2).labels: _*), currentDF(colName._1))
.otherwise(lit(null)))
}.drop("replace")
to replace new/unseen labels with null
- these deliberately introduced nulls are imputed by the most frequent imputer
However, this setup is very ugly. as tools like CrossValidator no longer work (as I can't supply a single pipeline)
How can I access the fitted labels within the pipeline to build an in Transformer which handles setting new values to null?
Do you see a better approach to accomplish handling new values? I assume most frequent imputation is ok i.e. for a dataset with around 90 columns only very few columns will contain an unseen label.