I'm currently experimenting with the possibilities of Sparkling-Water. There are a few possible Use-Cases including Data-Munging in H2O/Spark, Model Building and Offline-Training and Online Stream Prediction. I was wondering whether it is also possible to use Sparkling-Water for Online-Training together with a Kafka Streaming Source?
1 Answers
The Deep Learning model in particular can continuously train forever if you keep presenting new data. So you could do online training with that.
Models like DRM and GBM can “add another tree” from new data using a checkpoint, although you really don’t want to end up with infinity trees.
You could keep around a window of data and periodically train a new complete model. (Swapping in a new model instance at runtime is pretty straighforward. So you could just keep training in the background and update the model that predicts on streaming data periodically — like every hour or every few minutes, or whatever).
Or do your own ensembling by averaging the prediction of many models — by periodically throwing away older models and adding newer ones in a conveyor-belt type of strategy. Similar to a moving average.