1
votes

Lets say I have a function 'getData()' which returns data (see of it as a data stream). Now I need to form a h2o data frame with these data. I need to insert them as a new row only if it is not present in the data frame before.

One obvious way is to do :

  1. There is a global h2o data frame
  2. Create a h2o data frame (of 1 row) from the arrived data. (I am using as.h2o())
  3. Check if it is already present in the global data frame (using h2o.which() or any other function)
  4. If it is not present then add it to the data frame (using h2o.rbind())

The above solution is too slow. Creation of h2o data frame every time the data arrives (2nd step) is taking too much time. (Only tested on small dataset)

I was also thinking of storing them in a R data frame and then using h2o.rbind() after some intervals.

What is the best (time is the priority) way to do it?

1
This post needs a code example and some benchmarks. How are you creating the H2O Frame? Creating a frame shouldn't take much time -- step 2 is the real bottleneck here. I am doubtful that adding another step of creating an R data.frame would reduce the speed, but that's why code & benchmarks are the only way to really answer your question.Erin LeDell
@ErinLeDell I have edited the question. On small dataset (around 500 rows incrementally added) creation of h2o data frame seem to take more time than searching and binding.Vijay Giri
Ok thanks, it makes more sense now. What does getData() return, a 1-row data.frame in R?Erin LeDell
@ErinLeDell Yes.Vijay Giri

1 Answers

2
votes

You definitely want to minimize calls to as.h2o() as much as possible since that function actually writes data from R memory to disk and then reads the data into the H2O cluster from disk. It's meant to be used sparingly. However, one way to speed up the as.h2o() call is to use data.table on the backend. If you have data.table installed, you can add the following line to the top of your code and it will use data.table::fwrite() instead of utils::write.csv() inside of as.h2o().

library(data.table)
options("h2o.use.data.table" = TRUE)

Since you want to minimize calls to as.h2o(), it will probably be faster to store a few hundred or thousand rows in an R data.frame and then periodically convert that data.frame to an H2OFrame using as.h2o() (using data.table backend), then scan through the rows of the H2OFrame to see which ones are new and then add them to your "global" H2OFrame using h2o.rbind().

The only way to know for sure which method will be faster is to test both methods on your data and your machine.