Can I use a StringIndexer without One-Hot-Encoding it in PMML (exporting from Spark)?

Question

I'm trying to take a functional, fitted SparkML pipeline (Scala, Spark 2.1.1 for compatibility reasons) and turn it into PMML for interoperability and storage purposes.

At the moment, the pipeline has the following form: Array(StringIndexer,StringIndexer,VectorAssembler,VectorIndexer). I've tried the standard org.jpmml.sparkml.PMMLBuilder which works perfectly fine in situations where I'd already indexed the strings on the database. (I know how many distinct strings there are in these columns, and I'm completely certain that they'll stay categorical.) I'm planning on using them in a decision tree and a few other tree-based methods, and SparkML has lovely treatment of categorical variables in trees that make one-hot-encoding less than ideal.

val strCols = Array("stringCol1","stringCol2")

val strIndexers = strCols.map(c => new StringIndexer().setInputCol(c).setOutputCol(c+"_Indexed"))

val collist = df.columns.diff(strCols) ++ strCols.map(c => c+"_Indexed")

val vectorAssembler = new VectorAssembler() 
    .setInputCols(collist)
    .setOutputCol("rawFeatures")

val vectorIndexer = new VectorIndexer().setInputCol("rawFeatures").setOutputCol("features").setMaxCategories(35)

val pipeintro = new Pipeline().setStages(strIndexers :+ vectorAssembler :+ vectorIndexer)
val pipeIntro = pipeintro.fit(df)

val pmmlBuilder = new org.jpmml.sparkml.PMMLBuilder(df.schema, pipeIntro).buildFile(new File("out.pmml"))

I expected the code to complete running and output the appropriate PMML, but what I get instead is:

java.lang.IllegalArgumentException: Field stringCol1 has valid values [MT, IP, OB, GA, ED, OP]
  at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:209)
  at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:80)
  at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
  at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114)
  at org.jpmml.sparkml.PMMLBuilder.buildFile(PMMLBuilder.java:292)

I've checked for null values; there are none, nor are there other values that are invalid. There's some indication somewhere that StringIndexers are supposed to be one-hot-encoded before being put into a VectorAssembler, but that's suboptimal for this particular pipeline since it's intended to feed into a SparkML-defined tree, which deals well with multi-value categorical columns. Is that guidance hard-coded into PMML or the Spark-PMML encoder? Is there some other error that I'm missing?

user1808924 user1808924 · Accepted Answer · 2019-05-10T06:23:10

This particular exception is about conflicting "stringCol1" definition - its value space has been defined by StringIndexer ([MT, IP, OB, GA, ED, OP]), and now VectorIndexer is trying to re-define it with a different value space. One of those attempts is wrong.

Could be a bug of the JPMML-SparkML library, or your script. Perhaps the output of StringIndexerModel shouldn't be VectorIndexed at all?

Can I use a StringIndexer without One-Hot-Encoding it in PMML (exporting from Spark)?

1 Answers