Tagging columns as Categorical in Spark

Question

I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process.

Some queries are

How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present.
Also, StringIndexer maps categories to integers based on frequency of occurences. Now, when new data comes in, how do I make sure that this data is encoded consistently with training data? I sit possible o do that without StringIndexing the whole data again including the new data?

I quite confused on how to implement this.

zero323 zero323 · Accepted Answer · 2015-12-03T16:53:12

Is it possible o do that without StringIndexing the whole data again including the new data?

Yes, it is possible. You just have to use an indexer fitted on a training data. If you use ML pipelines it will be handled for you just use StringIndexerModel directly:

import org.apache.spark.ml.feature.StringIndexer

val train = sc.parallelize(Seq((1, "a"), (2, "a"), (3, "b"))).toDF("x", "y")
val test  = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "b"))).toDF("x", "y")

val indexer = new StringIndexer()
  .setInputCol("y")
  .setOutputCol("y_index")
  .fit(train)

indexer.transform(train).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  a|    0.0|
// |  3|  b|    1.0|
// +---+---+-------+

indexer.transform(test).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  b|    1.0|
// |  3|  b|    1.0|
// +---+---+-------+

One possible caveat is that it doesn't handle gracefully unseen labels so you have to drop these before transforming.

How does the RandomForestModel know which columns are categorical.

Different ML transformers add specialspecial metadata to the transformed columns which indicate type of the column, number of classes, etc.

import org.apache.spark.ml.attribute._
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("x", "y_index"))
  .setOutputCol("features")

val transformed = assembler.transform(indexer.transform(train))
val meta = AttributeGroup.fromStructField(transformed.schema("features"))
meta.attributes.get

// Array[org.apache.spark.ml.attribute.Attribute] = Array(
//   {"type":"numeric","idx":0,"name":"x"},
//   {"vals":["a","b"],"type":"nominal","idx":1,"name":"y_index"})

or

transformed.select($"features").schema.fields.last.metadata
// "ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"x"}], 
//  "nominal":[{"vals":["a","b"],"idx":1,"name":"y_index"}]},"num_attrs":2}}

Tagging columns as Categorical in Spark

1 Answers