Spark ML StringIndexer Different Labels Training/Testing

Question

I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category.

The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly.

I am processing the training/testing data in the exact same way, and don't save the model.

I have tried manually creating labels (by getting the index of the category), but get this error

java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

It seems that I must use StringIndexer, so how do I ensure that future datasets that I use for testing index the categories the same way as the training set?

EDIT adding code of my attempted workaround

This is what the dataframe looks like, call it mydata

+--------+-----+---------+---------+
|category|label|        x|        y|
+--------+-----+---------+---------+
| a|      0.0|  -0.166992|-0.256348|
| b|      1.0|  -0.179199| -0.22998|
| c|      2.0|  -0.172119|-0.105713|
| d|      3.0|  -0.064209| 0.050293|

I use vector assembler to prepare features

val assembler = new VectorAssembler().setInputCols(Array("x, y")).setOutputCol("features")

Transform mydata using above assembler, that does the features column

val predValues = assembler.transform(mydata)

So the model expects 2 columns, features and label. So I want to use my own label for this. I select features from predvalues

 val features = sqlContext.sql("SELECT features from predValues")

And select label from my df

 val labelDF = sqlContext.sql("SELECT label FROM filterFeaturesOnly")

And then join the two together so I'll have features and label to pass to model

val featuresAndLabels = features.join(labelDF)

This is what I am passing to the model, and I get the error mentioned above.

val label = predValues.join(labelDF)

Matthew Graves Matthew Graves · Accepted Answer · 2016-04-12T17:54:48

If you want to label things consistently, then you need to save the fitted stringIndexer.

Consider this sample code from the docs:

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)

The indexer.fit(df) piece returns a StringIndexerModel, which then can run the transform function. So instead:

val indexerModel = indexer.fit(trainDF)
val indexed = indexerModel.transform(trainDF)

Which will later allow you to use indexerModel.transform(testDF) to get the same labels for the same inputs.

Spark ML StringIndexer Different Labels Training/Testing

1 Answers