2
votes

I have a saved h2o model in mojo format, and now I am trying to load it and use it to make predictions on a new dataset (df) as part of a spark app written in scala. Ideally, I wish to append a new row to the existing DataFrame containing the class probability based on this model.

I can see how to apply a mojo to an individual row already in a RowData format (as per answer here), but I am not sure how to map over an existing DataFrame so that it is in the right format to make predictions using the mojo model. I have worked with DataFrames a fair bit, but never with the underlying RDDs.

Also, should this model be serialised / broadcast so that predictions can be done in parallel on a cluster, or will it be available to all executors as part of the map?

I have gotten this far:

# load mojo model and create easy predict model wrapper
val mojo = MojoModel.load("loca/path/to/mojo/mojo.zip")
val easyModel = new EasyPredictModelWrapper(mojo)

# map over spark DataFrame, converty to rdd, and make predictions on each row:
df.rdd.map { row =>
   val prediction = easyModel.predictBinomial(row).classProbabilities
   println(prediction)
   }

But my row variable is not in the right format for this to work. Any suggestions on what to try next?

EDIT: my DataFrame consists of 70 predictive feature columns which are a mixture of integers and category/factor columns. A very simple sample DataFrame:

val df = Seq(
  (0, 3, "cat1"),
  (1, 2, "cat2"),
  (2, 6, "cat1")
).toDF("id", "age", "category")
2
Please provide sample data for dataframe.Pratyush Sharma
Not sure how that helps - it should be a matter of mapping over the DF to extract each row in RowData format, no?renegademonkey

2 Answers

1
votes

Use this function to prepare RowData object needed for H2O:

def rowToRowData(df: DataFrame, row: Row): RowData = {
  val rowAsMap = row.getValuesMap[Any](df.schema.fieldNames)
  val rowData = rowAsMap.foldLeft(new RowData()) { case (rd, (k,v)) => 
    if (v != null) { rd.put(k, v.toString) }
    rd
  }
  rowData
}
0
votes

I have a complete answer here: https://stackoverflow.com/a/47898040/9120484 You can call map on df directly instead of on rdd.