2
votes

I have a dataset of 5000 variables. One target and 4999 covariates. I want to estimate one glm per each target-variable combination (4999 models).

How can I do that without manually typing 4999 formulas for GLM ?

In R I would simply define a list of 4999 strings ("target ~ x1) , convert each string to a formula and use map to estimate multiple glm. Is there something similar that can be done in Julia ? Or is there an elegant alternative ?

Thanks in advance.

2

2 Answers

7
votes

You can programatically create the formula via Term objects. The docs for that can be found here, but consider the following simple example which should meet your needs:

Start with dummy data

julia> using DataFrames, GLM

julia> df = hcat(DataFrame(y = rand(10)), DataFrame(rand(10, 5)))
10×6 DataFrame
│ Row │ y         │ x1        │ x2       │ x3        │ x4         │ x5       │
│     │ Float64   │ Float64   │ Float64  │ Float64   │ Float64    │ Float64  │
├─────┼───────────┼───────────┼──────────┼───────────┼────────────┼──────────┤
│ 1   │ 0.0200963 │ 0.924856  │ 0.947904 │ 0.429068  │ 0.00833488 │ 0.547378 │
│ 2   │ 0.169498  │ 0.0915296 │ 0.375369 │ 0.0341015 │ 0.390461   │ 0.835634 │
│ 3   │ 0.900145  │ 0.502495  │ 0.38106  │ 0.47253   │ 0.637731   │ 0.814095 │
│ 4   │ 0.255163  │ 0.865253  │ 0.791909 │ 0.0833828 │ 0.741899   │ 0.961041 │
│ 5   │ 0.651996  │ 0.29538   │ 0.161443 │ 0.23427   │ 0.23132    │ 0.947486 │
│ 6   │ 0.305908  │ 0.170662  │ 0.569827 │ 0.178898  │ 0.314841   │ 0.237354 │
│ 7   │ 0.308431  │ 0.835606  │ 0.114943 │ 0.19743   │ 0.344216   │ 0.97108  │
│ 8   │ 0.344968  │ 0.452961  │ 0.595219 │ 0.313425  │ 0.102282   │ 0.456764 │
│ 9   │ 0.126244  │ 0.593456  │ 0.818383 │ 0.485622  │ 0.151394   │ 0.043125 │
│ 10  │ 0.60174   │ 0.8977    │ 0.643095 │ 0.0865611 │ 0.482014   │ 0.858999 │

Now when you run a linear model with GLM, you'd do something like lm(@formula(y ~ x1), df), which indeed can't easily be used in a loop to construct different formulas. We'll therefore follow the docs and create the output of the @formula macro directly - remember macros in Julia just transform syntax to other syntax, so they don't do anything we can't write ourselves!

julia> lm(Term(:y) ~ Term(:x1), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x1

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.428436    0.193671   2.21    0.0579  -0.0181696   0.875041
x1           -0.106603    0.304597  -0.35    0.7354  -0.809005    0.595799
──────────────────────────────────────────────────────────────────────────

You can verify for yourself that the above is equivalent to lm(@formula(y ~ x1), df).

Now it's hopefully an easy step to building the loop that you're looking for (restricted to two covariates below to limit the output):


julia> for x ∈ names(df[:, Not(:y)])[1:2]
           @show lm(term(:y) ~ term(x), df)
       end
lm(term(:y) ~ term(x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x1

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.428436    0.193671   2.21    0.0579  -0.0181696   0.875041
x1           -0.106603    0.304597  -0.35    0.7354  -0.809005    0.595799
──────────────────────────────────────────────────────────────────────────
lm(Term(:y) ~ Term(x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x2

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)   0.639633    0.176542   3.62    0.0068   0.232527    1.04674
x2           -0.502327    0.293693  -1.71    0.1256  -1.17958     0.17493
─────────────────────────────────────────────────────────────────────────

As Dave points out below, it's helpful to use the term() function here to create our terms rather than the Term() constructor directly - this is because names(df) returns a vector of Strings, while the Term() constructor expects Symbols. term() has a method for Strings that handles the conversion automatically.

3
votes

You can also use the low-level API and pass the dependent variable as a vector and the independent variable as a matrix directly without even building formulas. You will lose coefficient names, but since you have only one independent variable in each model it's probably OK.

This is documented in ?fit. The call for each model will look like glm([ones(length(x1)) x1], target, dist). The column full of ones is for the intercept.