I can't load a RandomForestClassificationModel saved by Spark.
Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks.
Build and save model:
classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50)
model = classifier.fit(train)
result = model.transform(test)
model.write().save("/tmp/models/20161030-RF-topics-cats.model")
Later, in a separate program:
model = RandomForestClassificationModel.load("/tmp/models/20161029-RF-topics-cats.model")
gives:
Py4JJavaError: An error occurred while calling o81.load.
: org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/models/20161029-RF-topics-cats.model/treesMetadata. It must be specified manually;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:411)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:411)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:410)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:439)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:423)
at org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:441)
at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:301
I'd note that the same code works when I use a Naive Bayes classifier.
serialVersionUIDthat is used for this kind of thing. But as I said this is speculative, I don't know how to fix it, and it is some years since I looked at it. - Nick Lothian