R memory management advice (caret, model matrices, data frames)

Question

I'm running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:

> dim(basetrainf) # this is a dataframe
[1] 58168   118

The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret, cor, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem below) is fairly large. (sparse.model.matrix/Matrix is poorly supported in general, so I can't use that.)

> lsos()
                 Type      Size PrettySize   Rows Columns
mergem         matrix 879205616   838.5 Mb 115562     943
trainf     data.frame  80613120    76.9 Mb 106944     119
inttrainf      matrix  76642176    73.1 Mb    907   10387
mergef     data.frame  58264784    55.6 Mb 115562      75
dfbase     data.frame  48031936    45.8 Mb  54555     115
basetrainf data.frame  40369328    38.5 Mb  58168     118
df2        data.frame  34276128    32.7 Mb  54555     103
tf         data.frame  33182272    31.6 Mb  54555      98
m.gbm           train  20417696    19.5 Mb     16      NA
res.glmnet       list  14263256    13.6 Mb      4      NA

Also, since many R models don't support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).

R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.

The next thing I do is:

> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')

Bam - in a few seconds, the Linux out-of-memory killer kills rsession.

I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?

FWIW, I've never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and "data frame" support, across all models.

Complete script follows:

library(caret)
library(Matrix)
library(doMC)
registerDoMC(2)

response = 'class'

repr = 'dummy'
do.impute = F

xmode = function(xs) names(which.max(table(xs)))

read.orng = function(path) {
  # read header
  hdr = strsplit(readLines(path, n=1), '\t')
  pairs = sapply(hdr, function(field) strsplit(field, '#'))
  names = sapply(pairs, function(pair) pair[2])
  classes = sapply(pairs, function(pair)
    if (grepl('C', pair[1])) 'numeric' else 'factor')

  # read data
  dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='')

  # switch response, remove meta columns
  df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)]

  df
}

train.and.test = function(x, y, trains, method) {
  m = train(x[trains,], y[trains,], method=method)
  ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,])
  perf = postResample(ps$pred, ps$obs)
  list(m=m, ps=ps, perf=perf)
}

# From 
sparse.cor = function(x){
  memory.limit(size=10000)
  n 200 levels')
badfactors = sapply(mergef, function(x)
  is.factor(x) && (nlevels(x)  200))
mergef = mergef[, -which(badfactors)]

print('remove near-zero variance predictors')
mergef = mergef[, -nearZeroVar(mergef)]

print('create model matrix, making everything numeric')
if (repr == 'dummy') {
  dummies = dummyVars(as.formula(paste(response, '~ .')), mergef)
  mergem = predict(dummies, newdata=mergef)
} else {
  mat = if (repr == 'sparse') model.matrix else sparse.model.matrix
  mergem = mat(as.formula(paste(response, '~ .')), data=mergef)
  # remove intercept column
  mergem = mergem[, -1]
}

print('remove high-correlation predictors')
merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem)
mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)]

print('try a couple of different methods')
do.method = function(method) {
  train.and.test(mergem, mergef[response], mergef$istrain, method)
}
res.gbm = do.method('gbm')
res.glmnet = do.method('glmnet')
res.rf = do.method('parRF')

Did you end up switching software, or coming up with a solution in R? I'd be curious to hear what some of your more promising approaches were, as I'm having similar issues. I plan to use increasing higher-spec'd EC2 machines because they're convenient and I know R very well (until I need to implement some other solution). — lockedoff
@lockedoff I ended up just doing a lot more subsampling (one of the "non-ideal" solutions I mentioned - which should also include "buy more RAM")! — Yang
I am now able to evaluate 3x3x3 parameter grids using caret on a 350,000 x 30 dataframe fairly quickly. This was killing my 8GB quadcore macbook pro when running in parallel (each core was using too much memory), but yesterday I found out that it runs very fast on Amazon's High-Memory Double Extra Large Instance (aws.amazon.com/ec2/instance-types @ at about $0.42/hr as a spot instance). — lockedoff

Max Max · Accepted Answer · 2011-07-02T04:06:09

With that much data, the resampled error estimates and the random forest OOB error estimates should be pretty close. Try using trainControl(method = "OOB") and train() will not fit the extra models on resampled data sets.

Also, avoid the formula interface like the plague.

You also might try bagging instead. Since there is no random selection of predictors at each spit, you can get good result with 50-100 resamples (instead of many more needed by random forests to be effective).

Others may disagree, but I also think that modeling all the data you have is not always the best approach. Unless the predictor space is large, many of the data points will be very similar to others and don't contribute much to the model fit (besides the additional computation complexity and the footprint of the resulting object). caret has a function called maxDissim that might be helpful to thinning the data (although it is not terribly efficient either)

R memory management advice (caret, model matrices, data frames)

3 Answers