Removing outliers from linear regression when using multiple models

Question

I would like to rerun my multiple linear regression analyses having removed influential observations/outliers to determine their effect. My data has approximately 1000 observations of 30 variables (5 predictors, 25 outcomes).

df <- data.frame(replicate(30, sample(0:1000, 1000, rep = TRUE)))

I perform multiple linear regression for each of the 25 outcome variables:

library(tidyverse)

reg <- df %>%
  gather(outcome_name, outcome_value, -(X1 : X5)) %>%
  group_by(outcome_name) %>%
  nest() %>%
  mutate(model = map(data, ~lm(outcome_value ~ X1 + X2 + X3 + X4 + X5, data 
  =.)))

And then I can subsequently extract the statistics of interest:

stats <- reg %>%
  mutate(glance = map(model, broom::glance), 
         tidy = map(model, broom::tidy, conf.int = TRUE)
  )

I would like to rerun the above but having removed the outliers either identified, for example, by being > 2 standard deviations above mean, or by identifying them with something like Cook's distance. However, I can't figure out how to exclude the outliers in my code so that each regression model iterates appropriately.

I have tried filtering observations > 2 SD above mean for each outcome variable prior to performing the regression, but I then lost those observations for all 25 outcome regression models, as opposed to the single outcome model for which that observation is an outlier. Any suggestions appreciated.

ngm ngm · Accepted Answer · 2018-04-24T14:58:27

Use broom::augment to add the relevant measures to each dataset, and keep map-ping away.

For example:

library(tidyverse)
library(broom)
set.seed(1)
df <- data.frame(replicate(30, sample(0:1000, 1000, rep = TRUE)))

reg <- df %>%
  gather(outcome_name, outcome_value, -(X1 : X5)) %>%
  group_by(outcome_name) %>%
  nest() %>%
  mutate(model = map(data, ~lm(outcome_value ~ X1 + X2 + X3 + X4 + X5, data = .)),
         data2 = map(model, augment),
         data3 = map(data2, filter, abs(.std.resid) < 2),
         model2 = map(data3, ~lm(outcome_value ~ X1 + X2 + X3 + X4 + X5, data = .)))

Statistician's Disclaimer: I've solved the programming problem you asked. This should not be taken as an endorsement of the idea of automatically checking for, or doing anything with, so-called "outliers".

Removing outliers from linear regression when using multiple models

1 Answers