8
votes

I have a few questions or doubts on sparkling water and why is it needed.

Lets assume that I have a generated h2o model with both binary and pojo.

Now I want to deploy the model into production and have an option for using pojo and binary (sparkling water) both.

  1. Which one should I use? Direct spark with pojo or sparkling water with Binary.
  2. What is the exact use of sparkling water, when we can easily deploy a model using pojo and spark itself?
  3. Is sparkling water needed only when you have to train model on huge amounts of data? Or it can be used in PROD deployments of model's as well.

Example: https://github.com/h2oai/h2o-droplets/blob/master/h2o-pojo-on-spark-droplet/src/main/scala/examples/PojoExample.scala

Uses spark to run a pojo model.

Example: https://github.com/h2oai/h2o-droplets/blob/master/sparkling-water-droplet/src/main/scala/water/droplets/SparklingWaterDroplet.scala

Trains / Runs a model in sparkling water.

What are the advantages which sparkling water h2o provides over normal spark?

1

1 Answers

7
votes
  1. Which one should I use? Direct spark with pojo or sparkling water with Binary.

    • There is no 'right' answer, it depends on your use case. It sounds like what you want is the POJO/MOJO in Spark, so you can do scoring without the added dependency of having an H2O cluster up.
  2. What is the exact use of sparkling water, when we can easily deploy a model using pojo and spark itself?

    • The exact use of Sparkling Water is to have an H2O available within a Spark context. This is particularly useful for training: you can leverage Spark's many data connectors, munging capabilities etc. POJO/MOJO + Spark is sufficient for scoring
  3. Is sparkling water needed only when you have to train model on huge amounts of data? Or it can be used in PROD deployments of model's as well.

    • Sparkling Water is needed when you want to leverage H2O's algorithms in a context that plays nicely w/ the Spark ecosystem.

If putting a model in "production" means having "always on" scoring exposed as a REST endpoint or similar: the POJO/MOJO is the way you want to go (H2O clusters are not highly available). You'll need to make sure you're handling incoming data correctly yourself though.

If you are doing batch scoring, nightly or otherwise, then it may make sense to use the binary model w/ Sparkling Water because parsing incoming data becomes trivial (asH2OFrame(..)) and scoring is easy as predict()