0
votes

I am working with a very large dataset and I would like to keep the data in H2O as much as possible without bringing it into R.

I noticed whenever I pass an H2O Frame to a function, any modification I make to the Frame is not reflected outside of the function. Is there a way to pass the Frame by Reference?

If not, what's the best way to modify the original frame inside a function with copying all of the Frame?

Another related question: does passing a Frame to other functions (read only), make extra copies on H2O side? My datasets are 30GB - 100GB. So want to make sure passing them around does not cause memory issues.

mod = function(fdx) {
  fdx[,"x"] = -1
}

d = data.frame(x = rnorm(100),y=rnorm(100))
dx = as.h2o(d)
dx[1,]
mod(dx)
dx[1,]  # does not change the original value of x


 > dx[1,]
           x         y
 1 0.3114706 0.9523058

 > dx[1,]
           x         y
 1 0.3114706 0.9523058

Thanks!

1
data.table has the similar mechanism to use reference, but I am not sure it can use in your case. you can take a look here.Patric

1 Answers

2
votes

H2O does a classic copy-on-write optimization. Thus:

  • No true copy is made, unless you mutate the dataset.
  • Only changed/added columns are truly copied, all others pass-by-reference
  • Frames in R are pass-by-value, which H2O mimics
  • Frames in Python are pass-by-reference, which H2O mimics

In short, do as you would in R, and you're fine.

No extra copies.