1
votes

I have a set of linear regressions (lets assume 100) that I need to run with variable number of regressors in R. Some of the regressors are common to all 100 regression models, but others are variable and depends on the specific dependent variable. As an example, here are three such models:

Y1 ~ x1 + x2 + x3 + z1 + z2

Y2 ~ x1 + x2 + x3 + z4 + z5 + z6

Y3 ~ x1 + x2 + x3 + z3

As you can note, the dependent variables (Y1, Y2, and Y3) are all different. With the independent variables (i.e. regressor), there are three which are constant to all regression models (x1, x2, and x3), but then there are a number of regressors that are dependent-variable specific (z1, z2, z3, z4, z5, z6).

If I were to store all the variables (dependent and independent) in one dataframe with rows corresponding to each sample, and columns corresponding to variables, is there an easy way to run all the regressions by creating some sort of loop and without having to write each regression separately? Subsequently, I want to extract the residuals and store them in a new dataframe.

1
Generally, the idiom is to assemble the formulas as strings, then lapply over them, calling as.formula on each and passing it to glm in the anonymous function. The benefit of lapply is that you end up with a nice list of models (make it a list column of a data frame, if you like) which is easy to further iterate over, e.g. coef or broom::tidy.alistaire

1 Answers

1
votes

Here is one approach using the dataset mtcars as an example.

In the formula, ~. indicates that you want to use all variables of the subset data (the subset is specified in each lapply iteration)

data("mtcars")

# variables: list of vectors with the regressors and the dependent variable
# Note that the DP has to be the first variable of each vector here. 

variables <- list(c(1,2,3,4), c(1,3,5), c(1,4,5,6))
model_list <- lapply(1:length(variables), function(x) { 
              lm(mtcars[,variables[[x]][1]] ~ ., 
              data = mtcars[, variables[[x]][-1]])})

You can now obtain the residuals with

lapply(model_list, residuals)