4
votes

Is there a way I can use H2O to iterate over data that is larger than the cumulative memory size of the cluster? I have a big-data set which I need to iterate through in batches and feed into Tensorflow for gradient-descent. At a given time, I only need to load one batch (or a handful) in memory. Is there a way I can setup H2O to perform this kind of iteration without it loading the entire data-set into memory?

Here's a related question that was answered over a year ago, but doesn't solve my problem: Loading data bigger than the memory size in h2o

1
"feed into Tensorflow" implies you are using deep water, rather than the built-in h2o algorithms? It might be useful to specify what your cluster looks like: how many nodes, how many CPU cores and GPUs in each node, etc.Darren Cook
Darren, Thanks for your response. I'm using plain Tensorflow. I have a separate data transformation pipeline using Spark, Hive and HDFS which generates ORC files for training. Next, I need to stream batches of this data into Tensorflow / python and that's where this question arises. We wrote our own ORC python streamer using py4j but that's too slow. hence looking for a faster stream-io python lib for ORC/HDFS - tried pyarrow and one more lib besides Spark localIterator but ran into bugs in all cases. H2O works but requires massive RAM - and even then runs into memory issues sometimes.BoltzmannMachine

1 Answers

2
votes

The short answer is this isn't what H2O was designed to do. So unfortunately the answer today is no.


The longer answer... (Assuming that the intent of the question is regarding model training in H2O-3.x...)

I can think of at least two ways one might want to use H2O in this way: one-pass streaming, and swapping.

Think of one-pass streaming as having a continuous data stream feeding in, and the data constantly being acted on and then thrown away (or passed along).

Think of swapping as the computer science equivalent of swapping, where there is fast storage (memory) and slow storage (disk) and the algorithms are continuously sweeping over the data and faulting (swapping) data from disk to memory.

Swapping just gets worse and worse from a performance perspective the bigger the data gets. H2O isn't ever tested this way, and you are on your own. Maybe you can figure out how to enable an unsupported swapping mode from clues/hints in the other referenced stackoverflow question (or the source code), but nobody ever runs that way, and you're on your own. H2O was architected to be fast for machine learning by holding data in memory. Machine learning algorithms iteratively sweep over the data again and again. If every data touch is hitting the disk, it's just not the experience the in-memory H2O-3 platform was designed to provide.

The streaming use case, especially for some algorithms like Deep Learning and DRF, definitely makes more sense for H2O. H2O algorithms support checkpoints, and you can imagine a scenario where you read some data, train a model, then purge that data and read in new data, and continue training from the checkpoint. In the deep learning case, you'd be updating the neural network weights with the new data. In the DRF case, you'd be adding new trees based on the new data.