1
votes

I am trying out logistic regression on a data.frame (11359 rows, 137 columns). The data.frame contains Y (one dependent variable) and the predictors (136 independent variables). All the variables are binary.

The formula I created based on "my_data" data.frame is f = as.formula(paste('y ~', paste(colnames(my_data)[c(3:52, 54:133, 138:143)], collapse = '+'))). I applied glm, logistf and pmlr as follows

  • glm(f, family = binomial(link = "logit"), data = my_data)
  • logistf(f, my_data)
  • pmlr(f, data = my_data, method = "likelihood", joint = TRUE)

Glm function estimates some parameters but gives a Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred. I figured out that this message was generated due to separation issue so I tried logistf and pmlr functions.

With logistf, I didn't get any results after 50 hours without error, so I decided to terminate te process. (cpu usage 23-27%, ram usage approx. 1100mb during the first 10 hours, then 2-3mb).

For pmlr, I got this Error: cannot allocate vector of size 28.9 Gb.

I tried logistf and pmlr based on 10 out of 137 variables to check if the problem is the number of the predictors and I got the same. Logistf was working "for ever" and pmlr gave same type of error with different size of vector (bigger than previous!!!!, if I recall correctly approx. 45 Gb).

Should I update my laptop's RAM to perform this calculation, find some other functions (if there are other packages for penalized logistic regression) or it's a different kind of problem e.g. lot of variables?

Windows 10 x64, Processor: i3-2.4GHz, Ram: 8.00Gb, R version: x64 3.4.0, Rstudio: 1.0.143.

1
There is a limit to maximum size of a vector in R, which is 2^31 - 1. Maybe your data is exceeding this limit. Whatever your machine is, this size is the limit of a vector in R. If your problem is associated to this, only way is to work around with some other algorithm, involving breaking down your data and tolerate on accuracy of the model.Kalees Waran
what are you going to do with all the predictors: as an alt maybe chuck it through a lasso regression and see what predictors fall out? see ?glmnetuser20650

1 Answers

1
votes

https://cran.r-project.org/web/packages/biglm/biglm.pdf and https://www.rdocumentation.org/packages/biglm/versions/0.9-1/topics/biglm

biglm creates a linear model object that uses only p^2 memory for p variables. It can be updated with more data using update. This allows linear regression on data sets larger than memory.

bigglm creates a generalized linear model object that uses only p^2 memory for p variables.

bigglm Usage

bigglm(formula, data, family=gaussian(),...)
## S3 method for class
'
data.frame
'
bigglm(formula, data,...,chunksize=5000)
## S3 method for class
'
function
'
bigglm(formula, data, family=gaussian(),
weights=NULL, sandwich=FALSE, maxit=8, tolerance=1e-7,
start=NULL,quiet=FALSE,...)
## S3 method for class
'
RODBC
'
bigglm(formula, data, family=gaussian(),
tablename, ..., chunksize=5000)
## S4 method for signature
'
ANY,DBIConnection
'
bigglm(formula, data, family=gaussian(),
tablename, ..., chunksize=5000