R H2O - Memory management

Question

I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.

I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.

h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
  # Find all objects on server
  keysToKill <- h2o.ls(clust)$Key
  # Remove items to be excluded, if any
  keysToKill <- setdiff(keysToKill, vte)
  # Loop thru and remove items to be removed
  for(i in keysToKill){
    h2o.rm(object = clust, keys = i)

    if(verbose == TRUE){
      print(i);flush.console()

    }    
  }
  # Print remaining objects in cluster.
  h2o.ls(clust)
}

The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.

Here's some pseudo code to describe the process

# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")

# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
                      , nfolds = 3
                      , importance = T
                      , distribution = 'bernoulli' 
                      , n.trees = 100
                      , interaction.depth = 10,
                      , shrinkage = 0.01
  )
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)@model$auc

# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))

} # end loop

My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?

Any suggestions would be appreciated.

You can delete all what your want by key: ` h2o.rm(localH2O, "keyDataWhichIWantDelete")` — Fedorenko Kristina

TomKraljevic TomKraljevic · Accepted Answer · 2015-03-30T02:50:02

This answer is for the original H2O project (releases 2.x.y.z).

In the original H2O project, the H2O R package creates lots of temporary H2O objects in the H2O cluster DKV (Distributed Key/Value store) with a "Last.value" prefix.

These are visible both in the Store View from the Web UI and by calling h2o.ls() from R.

What I recommend doing is:

at the bottom of each loop iteration, use h2o.assign() to do a deep copy of anything you want to save to a known key name
use h2o.rm() to remove anything you don't want to keep, in particular the "Last.value" temps
call gc() explicitly in R somewhere in the loop

Here is a function which removes the Last.value temp objects for you. Pass in the H2O connection object as the argument:

removeLastValues <- function(conn) {
    df <- h2o.ls(conn)
    keys_to_remove <- grep("^Last\\.value\\.", perl=TRUE, x=df$Key, value=TRUE)
    unique_keys_to_remove = unique(keys_to_remove)
    if (length(unique_keys_to_remove) > 0) {
        h2o.rm(conn, unique_keys_to_remove)
    }
}

Here is a link to an R test in the H2O github repository that uses this technique and can run indefinitely without running out of memory:

https://github.com/h2oai/h2o/blob/master/R/tests/testdir_misc/runit_looping_slice_quantile.R

R H2O - Memory management

4 Answers