1
votes

My goal is to export an h2o model trained on spark with scala (using sparkling-water), such that I can import it in an application without Spark.

Thus:

  • using scala (the documentation only shows examples for r and python)
  • export a model which is build using sparkling-water (h2o with spark)
  • import a model in scala (without spark nor h2o cluster, only the hex-genmodel package)

I'm therefore using the ModelSerializationSupport to export, and the MojoModel.load to import

val gbmParams = new GBMParameters()
gbmParams._train = train
gbmParams._response_column = "target"
gbmParams._ntrees = 5
gbmParams._valid = valid
gbmParams._nfolds = 3 
gbmParams._min_rows = 1
gbmParams._distribution = DistributionFamily.multinomial
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
val mojoPath = "./model.zip"
ModelSerializationSupport.exportMOJOModel(gbmModel, new File(mojoPath).toURI, force = true)
val simpleModel = new EasyPredictModelWrapper(MojoModel.load(mojoPath))

Fails with

error in opening zip file
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:220)
at java.util.zip.ZipFile.<init>(ZipFile.java:150)
at java.util.zip.ZipFile.<init>(ZipFile.java:121)
at hex.genmodel.ZipfileMojoReaderBackend.<init>(ZipfileMojoReaderBackend.java:13)
at hex.genmodel.MojoModel.load(MojoModel.java:33)
...

It seems that the mojo exporter doesn't use the same format as expected in the hex.genmodel (a zip apparently)

Running on h2o 2.1.23 (2.1.24 fails when building the cluster, as reported on https://0xdata.atlassian.net/browse/SW-776) and spark 2.1

-- update:

Using the ModelSerializationSupport class to load it's own export fails too with the same exception:

ModelSerializationSupport.loadMOJOModel(new File(mojoPath).toURI)

H2OModel export and load
Loading back as H2OModel (thus with sparkling-water) does work:

val h2oModelPath = "./model_h2o"
ModelSerializationSupport.exportH2OModel(gbmModel, new File(h2oModelPath).toURI, force = true)
val loadedModel: GBMModel = ModelSerializationSupport.loadH2OModel(new File(h2oModelPath).toURI)

H2OMOJOModel export and load
Loading it back with H2OMOJOModel does work (copied from implementation of H2OGBM):

val mojoModel = new H2OMOJOModel(ModelSerializationSupport.getMojoData(gbmModel))
mojoModel.write.overwrite.save(mojoPath)
H2OMOJOModel.load(mojoPath) 

H2OGBM export with MojoModel import
Attempting to import using regular MojoModel fails though :

val gbm = new H2OGBM(gbmParams)(h2oContext, myspark.sqlContext)
val gbmModel = gbm.trainModel(gbmParams)
val mojoPath = "./models.zip"
gbmModel.write.overwrite.save(mojoPath)
MojoModel.load(mojoPath)

with the following exception:

./models.zip/model.ini (No such file or directory)
java.io.FileNotFoundException: ./models.zip/model.ini (No such file or directory)
1

1 Answers

0
votes

The solution is actually explained in the getMojoModel (which accepts either a Model[_,_,_] or Array[Byte]) on ModelSerializationSupport

The implementation of getMojoModel(Model[_,_,_]) uses a byte array to store getMojoData(Model[_,_,_]) to, and then reads it back from that byte array.

Quick test as follows works:

val config = new EasyPredictModelWrapper.Config()
config.setModel(ModelSerializationSupport.getMojoModel(gbmModel))
config.setConvertUnknownCategoricalLevelsToNa(true)
val easyPredictModelWrapper = new EasyPredictModelWrapper(config)

Thus now we can reproduce it, on our own, but without using the ModelSerializationSupport class (as it is part of sparkling water).

First store the mojo data to a file:

val path = java.nio.file.Files.createTempFile("model", ".mojo")
path.toFile.deleteOnExit()
path.toString
import java.io.FileOutputStream
val outputStream = new FileOutputStream(path.toFile)
try {
  gbmModel.getMojo.writeTo(outputStream
}
finally if (outputStream != null) outputStream.close()

And then read the bytes (in another scala application):

val is = new FileInputStream(path.toFile)
val reader = MojoReaderBackendFactory.createReaderBackend(is, MojoReaderBackendFactory.CachingStrategy.MEMORY)
val mojoModel = ModelMojoReader.readFrom(reader)
val config = new EasyPredictModelWrapper.Config()
config.setModel(mojoModel)
config.setConvertUnknownCategoricalLevelsToNa(true)
val easyPredictModelWrapper = new EasyPredictModelWrapper(config)