I am trying to write a function called grouped_lm, which basically runs linear regression models for each level of the combination of grouping variables (grouping.vars) for multiple criterion/dependent (crit.vars) and predictor/independent variables (pred.vars).
This is done in such a way that first entry into crit.vars is regressed onto pred.vars.
For example, if I enter grouping.vars = am, crit.vars = c(mpg, drat), and crit.vars = c(wt, disp) (in the context of mtcars dataset), the function will run two regression models (mpg ~ wt and drat ~ disp) for each level of grouping variable am (am = 0 and am = 1).
I have managed to create a dataframe from entered variables, write a custom function that runs linear regression models, but can't seem to figure out how to use rlang to get entered variables into the list elements that will be entered into purrr::pmap.
Apologies for a lengthy question and thanks in advance for any help offered.
# libraries needed
library(tidyverse)
library(plyr)
# function definition
grouped_lm <- function(data,
grouping.vars,
crit.vars,
pred.vars) {
#================== preparing dataframe ==================
#
# check how many variables were entered for criterion variables vector
crit.vars <-
as.list(rlang::quo_squash(rlang::enquo(crit.vars)))
crit.vars <-
if (length(crit.vars) == 1) {
crit.vars
} else {
crit.vars[-1]
}
# check how many variables were entered for predictor variables vector
pred.vars <-
as.list(rlang::quo_squash(rlang::enquo(pred.vars)))
pred.vars <-
if (length(pred.vars) == 1) {
pred.vars
} else {
pred.vars[-1]
}
# check how many variables were entered for grouping variable vector
grouping.vars <-
as.list(rlang::quo_squash(rlang::enquo(grouping.vars)))
grouping.vars <-
if (length(grouping.vars) == 1) {
grouping.vars
} else {
grouping.vars[-1]
}
# getting the dataframe ready
df <- dplyr::select(.data = data,
!!!grouping.vars,
!!!crit.vars,
!!!pred.vars) %>%
dplyr::group_by(.data = ., !!!grouping.vars) %>%
tidyr::nest(data = .)
# checking if the nested dataframe looks okay
cat(paste("the entire nested dataframe: \n"))
print(df) # the entire nested dataframe
cat(paste("first element of the list column from nested dataframe: \n"))
print(df$data[[1]]) # first element of the list column
#============== custom function ================
# custom function to run linear regression for every element of a list for two variables
lm_listed <- function(list.col, x_name, y_name) {
fx <- glue::glue("scale({y_name}) ~ scale({x_name})")
# this tags any names that are not predictor variables (used to remove intercept terms)
filter_name <- glue::glue("scale({x_name})")
# dataframe with results from lm
results_df <-
list.col %>% # running linear regression on each individual group with purrr
purrr::map(.x = .,
.f = ~ stats::lm(formula = as.formula(fx),
data = (.))) %>% # tidying up the output with broom
purrr::map_dfr(.x = .,
.f = ~ broom::tidy(x = .),
.id = "group") %>% # remove intercept terms
dplyr::filter(.data = ., term == !!filter_name) %>% # add formula as a character
dplyr::mutate(.data = ., formula = as.character(fx)) %>% # rearrange the dataframe
dplyr::select(
.data = .,
group,
formula,
term,
estimate,
std.error,
t = statistic,
p.value
) %>% # convert to a tibble dataframe
tibble::as_data_frame(x = .)
# return the dataframe
return(results_df)
}
# check if the function works
group_mtcars <- split(mtcars, mtcars$am)
fn_results <- purrr::pmap(.l = list(
l = list(group_mtcars),
x_name = list('wt', 'disp'),
y_name = list('mpg', 'drat')
),
.f = lm_listed) %>%
dplyr::bind_rows()
# seems to be working!
cat(paste("the custom function seems to be working!: \n"))
print(fn_results)
#========= using custom function on entered dataframe =================
cat(paste("running the custom function on the entered dataframe: \n"))
# running custom function for each element of the created list column
df_lm <- purrr::pmap(.l = list(
l = list(df$data),
x_name = list(!!!pred.vars),
y_name = list(!!!crit.vars)
),
.f = lm_listed) %>%
dplyr::bind_rows()
#============================== output ========================
print(df_lm)
# return the final dataframe with results
return(df_lm)
}
# example usage of the function
grouped_lm(
data = iris,
crit.vars = c(Sepal.Length, Petal.Length),
pred.vars = c(Sepal.Width, Petal.Width),
grouping.vars = Species
)
#> the entire nested dataframe:
#> # A tibble: 3 x 2
#> Species data
#> <fct> <list>
#> 1 setosa <tibble [50 x 4]>
#> 2 versicolor <tibble [50 x 4]>
#> 3 virginica <tibble [50 x 4]>
#> first element of the list column from nested dataframe:
#> # A tibble: 50 x 4
#> Sepal.Length Petal.Length Sepal.Width Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.10 1.40 3.50 0.200
#> 2 4.90 1.40 3.00 0.200
#> 3 4.70 1.30 3.20 0.200
#> 4 4.60 1.50 3.10 0.200
#> 5 5.00 1.40 3.60 0.200
#> 6 5.40 1.70 3.90 0.400
#> 7 4.60 1.40 3.40 0.300
#> 8 5.00 1.50 3.40 0.200
#> 9 4.40 1.40 2.90 0.200
#> 10 4.90 1.50 3.10 0.100
#> # ... with 40 more rows
#> the custom function seems to be working!:
#> # A tibble: 4 x 7
#> group formula term estimate std.error t p.value
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 0 scale(mpg) ~ scale(wt) scale(~ -0.768 0.155 -4.94 1.25e-4
#> 2 1 scale(mpg) ~ scale(wt) scale(~ -0.909 0.126 -7.23 1.69e-5
#> 3 0 scale(drat) ~ scale(disp) scale(~ -0.614 0.192 -3.20 5.20e-3
#> 4 1 scale(drat) ~ scale(disp) scale(~ -0.305 0.287 -1.06 3.12e-1
#> running the custom function on the entered dataframe:
#> Error in !pred.vars: invalid argument type
Created on 2018-03-23 by the reprex package (v0.2.0).
Edits after the answer was provided
I was also wondering how I can get separate columns in the ouput for every grouping variable. So, with the answer provided below, if I run-
grouped_lm(
data = mtcars,
crit.vars = c(wt, mpg),
pred.vars = c(drat, disp),
grouping.vars = c(am, cyl)
)
It works, but the output looks something like this:
As can be appreciated from the picture, it is not at all clear what do the values 1 to 6 represent. So it will be nice to also get a separate column for each grouping variable provided so that, in this example, there will be two columns am and cyl with their respective levels for each lm model.
(I have manually created this dataframe. This is not how the grouping is happening, but this is just to show what the desired output looks like.)

