I would like to rerun my multiple linear regression analyses having removed influential observations/outliers to determine their effect. My data has approximately 1000 observations of 30 variables (5 predictors, 25 outcomes).
df <- data.frame(replicate(30, sample(0:1000, 1000, rep = TRUE)))
I perform multiple linear regression for each of the 25 outcome variables:
library(tidyverse)
reg <- df %>%
gather(outcome_name, outcome_value, -(X1 : X5)) %>%
group_by(outcome_name) %>%
nest() %>%
mutate(model = map(data, ~lm(outcome_value ~ X1 + X2 + X3 + X4 + X5, data
=.)))
And then I can subsequently extract the statistics of interest:
stats <- reg %>%
mutate(glance = map(model, broom::glance),
tidy = map(model, broom::tidy, conf.int = TRUE)
)
I would like to rerun the above but having removed the outliers either identified, for example, by being > 2 standard deviations above mean, or by identifying them with something like Cook's distance. However, I can't figure out how to exclude the outliers in my code so that each regression model iterates appropriately.
I have tried filtering observations > 2 SD above mean for each outcome variable prior to performing the regression, but I then lost those observations for all 25 outcome regression models, as opposed to the single outcome model for which that observation is an outlier. Any suggestions appreciated.