I'm trying to take a functional, fitted SparkML pipeline (Scala, Spark 2.1.1 for compatibility reasons) and turn it into PMML for interoperability and storage purposes.
At the moment, the pipeline has the following form: Array(StringIndexer,StringIndexer,VectorAssembler,VectorIndexer). I've tried the standard org.jpmml.sparkml.PMMLBuilder which works perfectly fine in situations where I'd already indexed the strings on the database. (I know how many distinct strings there are in these columns, and I'm completely certain that they'll stay categorical.) I'm planning on using them in a decision tree and a few other tree-based methods, and SparkML has lovely treatment of categorical variables in trees that make one-hot-encoding less than ideal.
val strCols = Array("stringCol1","stringCol2")
val strIndexers = strCols.map(c => new StringIndexer().setInputCol(c).setOutputCol(c+"_Indexed"))
val collist = df.columns.diff(strCols) ++ strCols.map(c => c+"_Indexed")
val vectorAssembler = new VectorAssembler()
.setInputCols(collist)
.setOutputCol("rawFeatures")
val vectorIndexer = new VectorIndexer().setInputCol("rawFeatures").setOutputCol("features").setMaxCategories(35)
val pipeintro = new Pipeline().setStages(strIndexers :+ vectorAssembler :+ vectorIndexer)
val pipeIntro = pipeintro.fit(df)
val pmmlBuilder = new org.jpmml.sparkml.PMMLBuilder(df.schema, pipeIntro).buildFile(new File("out.pmml"))
I expected the code to complete running and output the appropriate PMML, but what I get instead is:
java.lang.IllegalArgumentException: Field stringCol1 has valid values [MT, IP, OB, GA, ED, OP]
at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:209)
at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:80)
at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114)
at org.jpmml.sparkml.PMMLBuilder.buildFile(PMMLBuilder.java:292)
I've checked for null values; there are none, nor are there other values that are invalid. There's some indication somewhere that StringIndexers are supposed to be one-hot-encoded before being put into a VectorAssembler, but that's suboptimal for this particular pipeline since it's intended to feed into a SparkML-defined tree, which deals well with multi-value categorical columns. Is that guidance hard-coded into PMML or the Spark-PMML encoder? Is there some other error that I'm missing?