Why does H2O integrate TensorFlow via Spark instead of directly?

Question

I really like H2O especially because you can deploy the built models easily into any Java / JVM application... This is also my goal for TensorFlow: Build models and then run them in Java applications.

H2O uses Spark (Sparking Water) "in the middle" when using TensorFlow by running TensorFlow on the distributed Spark nodes. I learned this (hopefully correctly) in a H2O demo video.

Why do you not integrate TensorFlow (and others like MXNet) directly with H2O, but instead go through Apache Spark?

Frameworks like TensorFlow itself allow distributed training, so why use Spark "in the middle"? Doesn't this increase complexity a lot (and no need for it in many scenarios)?

For example, Google built Scikit Flow (Scikit-learn + TensorFlow)to allow easy usage of TensorFlow to build models.

Especially for smaller data sets and / or simpler use cases, this seems to be the easier option that using Spark in the middle? If I understand correctly, you could also use this model in Java then via TensorFlow4Java.

I want to leverage H2O much more in future projects and scenarios (like in the past, see e.g. here where I applied a H2O model to real time applications using Apache Kafka and its Streams API). Though, I am not sure why I need the "overhead" of Spark for building models with H2O and TensorFlow (especially for smaller data sets and / or simple scenarios where a "small neural network" might be good enough?

Mateusz Dymczyk Mateusz Dymczyk · Accepted Answer · 2017-06-08T11:32:55

That is an old tutorial, which was made before distributed Tensorflow was released. At that time this was the easiest way to "integrate"/"distribute" it (by simply running TF on each worker node and then averaging the coefficients on the driver node).

Since then we have released project DeepWater which integrates several DL frameworks (TF, MXNet, Caffee) with H2O. This one doesn't require Spark, all you need to do (if you're running a linux machine) is to download H2O from here (H2O with GPU-Enabled Machine Learning) and run it. If you're running MacOS or Windows you'd have to build both DeepWater and H2O yourself as we don't provide those yet.

As a disclaimer: deepwater currently runs only on a single node, as none of the mentioned frameworks have Java APIs for distribution (for example TF only has Python API for that). I'm working on a distributed version but for now it's not really a #1 priority.

Why does H2O integrate TensorFlow via Spark instead of directly?

1 Answers