I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category.
The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly.
I am processing the training/testing data in the exact same way, and don't save the model.
I have tried manually creating labels (by getting the index of the category), but get this error
java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
It seems that I must use StringIndexer, so how do I ensure that future datasets that I use for testing index the categories the same way as the training set?
EDIT adding code of my attempted workaround
This is what the dataframe looks like, call it mydata
+--------+-----+---------+---------+
|category|label| x| y|
+--------+-----+---------+---------+
| a| 0.0| -0.166992|-0.256348|
| b| 1.0| -0.179199| -0.22998|
| c| 2.0| -0.172119|-0.105713|
| d| 3.0| -0.064209| 0.050293|
I use vector assembler to prepare features
val assembler = new VectorAssembler().setInputCols(Array("x, y")).setOutputCol("features")
Transform mydata using above assembler, that does the features column
val predValues = assembler.transform(mydata)
So the model expects 2 columns, features and label. So I want to use my own label for this. I select features from predvalues
val features = sqlContext.sql("SELECT features from predValues")
And select label from my df
val labelDF = sqlContext.sql("SELECT label FROM filterFeaturesOnly")
And then join the two together so I'll have features and label to pass to model
val featuresAndLabels = features.join(labelDF)
This is what I am passing to the model, and I get the error mentioned above.
val label = predValues.join(labelDF)