I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.
I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.
h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
# Find all objects on server
keysToKill <- h2o.ls(clust)$Key
# Remove items to be excluded, if any
keysToKill <- setdiff(keysToKill, vte)
# Loop thru and remove items to be removed
for(i in keysToKill){
h2o.rm(object = clust, keys = i)
if(verbose == TRUE){
print(i);flush.console()
}
}
# Print remaining objects in cluster.
h2o.ls(clust)
}
The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.
Here's some pseudo code to describe the process
# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")
# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
, nfolds = 3
, importance = T
, distribution = 'bernoulli'
, n.trees = 100
, interaction.depth = 10,
, shrinkage = 0.01
)
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)@model$auc
# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))
} # end loop
My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?
Any suggestions would be appreciated.