Is it possible o do that without StringIndexing the whole data again including the new data?
Yes, it is possible. You just have to use an indexer fitted on a training data. If you use ML pipelines it will be handled for you just use StringIndexerModel
directly:
import org.apache.spark.ml.feature.StringIndexer
val train = sc.parallelize(Seq((1, "a"), (2, "a"), (3, "b"))).toDF("x", "y")
val test = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "b"))).toDF("x", "y")
val indexer = new StringIndexer()
.setInputCol("y")
.setOutputCol("y_index")
.fit(train)
indexer.transform(train).show
// +---+---+-------+
// | x| y|y_index|
// +---+---+-------+
// | 1| a| 0.0|
// | 2| a| 0.0|
// | 3| b| 1.0|
// +---+---+-------+
indexer.transform(test).show
// +---+---+-------+
// | x| y|y_index|
// +---+---+-------+
// | 1| a| 0.0|
// | 2| b| 1.0|
// | 3| b| 1.0|
// +---+---+-------+
One possible caveat is that it doesn't handle gracefully unseen labels so you have to drop these before transforming.
How does the RandomForestModel know which columns are categorical.
Different ML transformers add specialspecial metadata to the transformed columns which indicate type of the column, number of classes, etc.
import org.apache.spark.ml.attribute._
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("x", "y_index"))
.setOutputCol("features")
val transformed = assembler.transform(indexer.transform(train))
val meta = AttributeGroup.fromStructField(transformed.schema("features"))
meta.attributes.get
// Array[org.apache.spark.ml.attribute.Attribute] = Array(
// {"type":"numeric","idx":0,"name":"x"},
// {"vals":["a","b"],"type":"nominal","idx":1,"name":"y_index"})
or
transformed.select($"features").schema.fields.last.metadata
// "ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"x"}],
// "nominal":[{"vals":["a","b"],"idx":1,"name":"y_index"}]},"num_attrs":2}}