I'm running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:
> dim(basetrainf) # this is a dataframe [1] 58168 118
The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret
, cor
, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem
below) is fairly large. (sparse.model.matrix
/Matrix
is poorly supported in general, so I can't use that.)
> lsos() Type Size PrettySize Rows Columns mergem matrix 879205616 838.5 Mb 115562 943 trainf data.frame 80613120 76.9 Mb 106944 119 inttrainf matrix 76642176 73.1 Mb 907 10387 mergef data.frame 58264784 55.6 Mb 115562 75 dfbase data.frame 48031936 45.8 Mb 54555 115 basetrainf data.frame 40369328 38.5 Mb 58168 118 df2 data.frame 34276128 32.7 Mb 54555 103 tf data.frame 33182272 31.6 Mb 54555 98 m.gbm train 20417696 19.5 Mb 16 NA res.glmnet list 14263256 13.6 Mb 4 NA
Also, since many R models don't support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).
R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.
The next thing I do is:
> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')
Bam - in a few seconds, the Linux out-of-memory killer kills rsession.
I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?
FWIW, I've never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and "data frame" support, across all models.
Complete script follows:
library(caret) library(Matrix) library(doMC) registerDoMC(2) response = 'class' repr = 'dummy' do.impute = F xmode = function(xs) names(which.max(table(xs))) read.orng = function(path) { # read header hdr = strsplit(readLines(path, n=1), '\t') pairs = sapply(hdr, function(field) strsplit(field, '#')) names = sapply(pairs, function(pair) pair[2]) classes = sapply(pairs, function(pair) if (grepl('C', pair[1])) 'numeric' else 'factor') # read data dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='') # switch response, remove meta columns df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)] df } train.and.test = function(x, y, trains, method) { m = train(x[trains,], y[trains,], method=method) ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,]) perf = postResample(ps$pred, ps$obs) list(m=m, ps=ps, perf=perf) } # From sparse.cor = function(x){ memory.limit(size=10000) n 200 levels') badfactors = sapply(mergef, function(x) is.factor(x) && (nlevels(x) 200)) mergef = mergef[, -which(badfactors)] print('remove near-zero variance predictors') mergef = mergef[, -nearZeroVar(mergef)] print('create model matrix, making everything numeric') if (repr == 'dummy') { dummies = dummyVars(as.formula(paste(response, '~ .')), mergef) mergem = predict(dummies, newdata=mergef) } else { mat = if (repr == 'sparse') model.matrix else sparse.model.matrix mergem = mat(as.formula(paste(response, '~ .')), data=mergef) # remove intercept column mergem = mergem[, -1] } print('remove high-correlation predictors') merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem) mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)] print('try a couple of different methods') do.method = function(method) { train.and.test(mergem, mergef[response], mergef$istrain, method) } res.gbm = do.method('gbm') res.glmnet = do.method('glmnet') res.rf = do.method('parRF')
caret
on a 350,000 x 30 dataframe fairly quickly. This was killing my 8GB quadcore macbook pro when running in parallel (each core was using too much memory), but yesterday I found out that it runs very fast on Amazon's High-Memory Double Extra Large Instance (aws.amazon.com/ec2/instance-types @ at about $0.42/hr as a spot instance). – lockedoff