1
votes

Is it at all possible to use the lm() function with a matrix? Or maybe, the correct question is: "Is it possible to dynamically create formulas in R?"

I am creating a function whose output is a matrix and the number of columns in the matrix is not fixed = it depends on the inputs of the user. I want to fit an OLS model using the data in the matrix. - The first column represents the dependent variable - The other columns are the independent variables.

Using the lm function requires a formula, which presupposes the knowledge of the number of explanatory variables, which is not my case!

Is there any solution other than estimating the equation manually with the OLS formula?

Reproducible example:

> # When user 1 uses the function, he obtains m1
> m1 <- replicate(5, rnorm(50))
> colnames(m1) <- c("dep", paste0("ind", 1:(ncol(m1)-1)))
> head(m1)
            dep       ind1        ind2       ind3       ind4
[1,]  0.5848705  0.3602760 -0.95493403 -1.7278030 -0.1914170
[2,]  1.7167604 -0.1035825  0.31026183 -1.5071415 -1.2748600
[3,] -0.1326187 -0.5669026  0.01819749  0.8346880 -0.6304498
[4,] -0.7381232  0.4612792 -0.36132404 -0.1183131 -0.7446985
[5,]  0.9919123 -1.3228248 -0.44728270  0.6571244 -0.4895385
[6,] -0.8010111  0.8307584 -0.16106804  0.3069870 -0.3834583
> 
> # When user 2 uses the function, he obtains m2
> m2 <- replicate(6, rnorm(50))
> colnames(m2) <- c("dep", paste0("ind", 1:(ncol(m2)-1)))
> head(m2)
            dep       ind1       ind2         ind3       ind4       ind5
[1,]  1.2936031 -0.8060085  0.5020699 -1.699123234  1.0205626  1.0787888
[2,]  1.2357370  0.5973699 -1.2134283 -0.928040354 -0.3037920 -0.1251678
[3,]  0.5292583  0.1063213 -1.3036526  0.395886937 -0.1280863  1.1423532
[4,]  0.9234484 -0.4505604  1.2796922  0.424705893 -0.5547274 -0.3794037
[5,] -0.8016376  1.1362677 -1.1935238 -0.004460092 -1.4449704 -0.3739311
[6,]  0.4385867  0.5671138  0.4493617 -2.277925642 -0.8626944 -0.6880523

User 1 will estimate the linear model with:

lm(dep ~ ind1 + ind2 + ind3 + ind4, data = m1)

Meanwhile user 2 has an extra independent variable and will estimate the linear model in the following way:

lm(dep ~ ind1 + ind2 + ind3 + ind4 + ind5, data = m1)

Once again, is there any way I can create the formula dynamically?

1
lm(dep ~ ., data =m1)Khashaa
dep ~ . is bad style because it will pick up any extra or derived columns you create, possibly causing data leakage.smci
Thank you for the link. The solution is in the reformulate function.SavedByJESUS
You don't need that if you just want to use a matrix slice: lm(m1[,'dep'] ~ m1[,2:5])smci

1 Answers

2
votes

Yes, and in fact the formula interface has performance issues the larger the number of columns. So in fact the matrix interface is preferred for large column widths.

Is there any way I can create the formula dynamically?

Sure, you look up the matrix columns either directly by an vector of column-indices, or indirectly by converting a vector of names into column-indices using grep(cols_you_want, names(mat))

But in your case, you don't need to bother with grep since you already have a straightforward column-naming scheme, you know that ind1...ind5 corresponds to column-indices 1..5

lm(m1[,'dep'] ~ m1[,2:5])

# or in general
lm(m1[,'dep'] ~ m1[,colIndicesVector])  # e.g. c(1,3,4)