Context: I have a data frame where all categorical values have been indexed using StringIndexer.
val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name }
val categoryIndexers = categoricalColumns.map {
col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed")
}
Then I used VectorAssembler to vectorize all feature columns (including the indexed categorical ones).
val assembler = new VectorAssembler()
.setInputCols(dfIndexed.columns.diff(List("label") ++ categoricalColumns))
.setOutputCol("features")
After applying the classifier and a few additional steps I end up with a data frame that has label, features, and prediction. I would like expand my features vector to separate columns in order to convert the indexed values back to their original String form.
val categoryConverters = categoricalColumns.zip(categoryIndexers).map {
colAndIndexer => new IndexToString().setInputCol(s"${colAndIndexer._1}Indexed").setOutputCol(colAndIndexer._1).setLabels(colAndIndexer._2.fit(df).labels)
}
Question: Is there a simple way of doing this, or is the best approach to somehow attach the prediction column to the test data frame?
What I have tried:
val featureSlicers = categoricalColumns.map {
col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}Indexed").setNames(Array(s"${col}Indexed"))
}
Applying this gives me the columns that I want, but they are in Vector form (as it is meant to do) and not type Double.
Edit: The desired output is the original data frame (i.e. categorical features as String not index) with an additional column indicating the predicted label (which in my case is 0 or 1).
For example, say the output of my classifier looked something like this:
+-----+---------+----------+
|label| features|prediction|
+-----+---------+----------+
| 1.0|[0.0,3.0]| 1.0|
+-----+---------+----------+
By applying VectorSlicer on each feature I would get:
+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| [0.0]| [3.0]|
+-----+---------+----------+-------------+-------------+
Which is great, but I need:
+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| 0.0 | 3.0 |
+-----+---------+----------+-------------+-------------+
To then be able to use IndexToString and convert it to:
+-----+---------+----------+-------------+-------------+
|label| features|prediction| status | artist |
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| good | Pink Floyd |
+-----+---------+----------+-------------+-------------+
or even:
+-----+----------+-------------+-------------+
|label|prediction| status | artist |
+-----+----------+-------------+-------------+
| 1.0| 1.0| good | Pink Floyd |
+-----+----------+-------------+-------------+
StringIndexer
andOneHotEncoder
. The problem comes when I try to realize what represents every feature of the combined vector. – Alberto Bonsanto