7
votes

I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.

I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.

h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
  # Find all objects on server
  keysToKill <- h2o.ls(clust)$Key
  # Remove items to be excluded, if any
  keysToKill <- setdiff(keysToKill, vte)
  # Loop thru and remove items to be removed
  for(i in keysToKill){
    h2o.rm(object = clust, keys = i)

    if(verbose == TRUE){
      print(i);flush.console()

    }    
  }
  # Print remaining objects in cluster.
  h2o.ls(clust)
}

The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.

Here's some pseudo code to describe the process

# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")

# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
                      , nfolds = 3
                      , importance = T
                      , distribution = 'bernoulli' 
                      , n.trees = 100
                      , interaction.depth = 10,
                      , shrinkage = 0.01
  )
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)@model$auc

# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))

} # end loop

My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?

Any suggestions would be appreciated.

4
You can delete all what your want by key: ` h2o.rm(localH2O, "keyDataWhichIWantDelete")`Fedorenko Kristina

4 Answers

9
votes

This answer is for the original H2O project (releases 2.x.y.z).

In the original H2O project, the H2O R package creates lots of temporary H2O objects in the H2O cluster DKV (Distributed Key/Value store) with a "Last.value" prefix.

These are visible both in the Store View from the Web UI and by calling h2o.ls() from R.

What I recommend doing is:

  • at the bottom of each loop iteration, use h2o.assign() to do a deep copy of anything you want to save to a known key name
  • use h2o.rm() to remove anything you don't want to keep, in particular the "Last.value" temps
  • call gc() explicitly in R somewhere in the loop

Here is a function which removes the Last.value temp objects for you. Pass in the H2O connection object as the argument:

removeLastValues <- function(conn) {
    df <- h2o.ls(conn)
    keys_to_remove <- grep("^Last\\.value\\.", perl=TRUE, x=df$Key, value=TRUE)
    unique_keys_to_remove = unique(keys_to_remove)
    if (length(unique_keys_to_remove) > 0) {
        h2o.rm(conn, unique_keys_to_remove)
    }
}

Here is a link to an R test in the H2O github repository that uses this technique and can run indefinitely without running out of memory:

https://github.com/h2oai/h2o/blob/master/R/tests/testdir_misc/runit_looping_slice_quantile.R

4
votes

New suggestion as of 12/15/2015: update to latest stable (Tibshirani 3.6.0.8 or later). We've completely reworked how R & H2O handle internal temp variables, and the memory management is much smoother.

Next: H2O temps can be held "alive" by R dead variables... so run an R gc() every loop iteration. Once R's GC removes the dead variables, H2O will reclaim that memory.

After that, your cluster should only hold on to specifically named things, like loaded datasets, and models. These you'll need to delete roughly as fast as you make them, to avoid accumulating large data in the K/V store.

Please let us know if you have any more problems by posting to the google group h2o stream: https://groups.google.com/forum/#!forum/h2ostream

Cliff

2
votes

The most current answer to this question is that you should probably just use the h2o.grid() function rather than writing a loop.

0
votes

With the H2O new version (currently 3.24.0.3), they suggest to use the following recommendations:

my for loop {
 # perform loop

 rm(R object that isn’t needed anymore)
 rm(R object of h2o thing that isn’t needed anymore)

 # trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.
 gc()
 # optional extra one to be paranoid.  this is usually very fast.
 gc()

 # optionally sanity check that you see only what you expect to see here, and not more.
 h2o.ls()

 # tell back-end cluster nodes to do three back-to-back JVM full GCs.
 h2o:::.h2o.garbageCollect()
 h2o:::.h2o.garbageCollect()
 h2o:::.h2o.garbageCollect()
}

Here the source: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html