3
votes

I'm trying to run multiple logistic regression analyses for each of ~400k predictor variables. I would like to capture the outputs of each run into a row/column of an output table.

My data organised in two parts. I have a 400000 x 189 double matrix (mydatamatrix) that contains the observations/data for each of my 400000 predictor variables measured in 189 individuals (P1). I also have a second 189 x 20 data frame (mydataframe) containing the outcome variable and another predictor variable (O1 and P2) plus 18 other variables not used in this particular analysis.

My regression model is O1~ P1+P2, where O1 is binary.

I got the following loop to work:

create output file for results

output<-data.frame(matrix(nrow=400000, ncol=4))
names(output)=c("Estimate", " Std. Error", " z value", " Pr(>|z|)")

run logistic regression loop for i predictors and store output in output file

for (i in c(1:400000)){
  result<-(glm(mydataframe$O1 ~ mydatamatrix[,i] + as.factor(mydataframe$P2),
               family=binomial))
  row.names(output)<-row.names(mydatamatrix)
  output[i,1]<-coef(summary(result))[2,1]
  output[i,2]<-coef(summary(result))[2,2]
  output[i,3]<-coef(summary(result))[2,3]
  output[i,4]<-coef(summary(result))[2,4]
}

However, the run time is huge (it took over an hour to output the first 20k tests). Is there a more efficient way to run this analysis?

1
My actual advice is likely to be a bit of a letdown: don't do this at all. Running 400k regressions on each variable is not likely to yield much useful information. At the very least, I would investigate some penalized regression methods for variable selection, like the lasso or LARS.joran

1 Answers

1
votes

It will be faster if you use apply instead of a for loop:

t(apply(mydatamatrix, 2,
        function(x)
          coef(summary(glm(mydataframe$O1 ~ x + as.factor(mydataframe$P2), 
                           family=binomial)))[2, 1:4]))