0
votes

My aim to to obtain a linear regression model of a dataset and its associated residuals after removing outliers.

Using the 'iris' data set to illustrate:

This original model with no observations removed

(MODEL1)

library(dplyr)
library(magrittr)
library(broom)

    iris %>%
    +   do(tidy(lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .)))

               term   estimate  std.error statistic      p.value
1       (Intercept)  2.3903891 0.26226815  9.114294 5.942826e-16
2       Sepal.Width  0.4322172 0.08138982  5.310458 4.025982e-07
3      Petal.Length  0.7756295 0.06424566 12.072869 1.151112e-23
4 Speciesversicolor -0.9558123 0.21519853 -4.441537 1.759999e-05
5  Speciesvirginica -1.3940979 0.28566053 -4.880261 2.759618e-06

But I want to remodel with some outliers (based on .cooksd) removed. Ie:

(MODEL2)

iris %>% 
+   do(augment(lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .))) %>% 
+   filter(.cooksd < 0.03) %>% 
+   do(tidy(lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .)))


               term   estimate  std.error statistic      p.value
1       (Intercept)  2.3927287 0.23718040 10.088223 2.875549e-18
2       Sepal.Width  0.4150542 0.07374143  5.628508 9.775805e-08
3      Petal.Length  0.8035635 0.05975821 13.446914 7.229176e-27
4 Speciesversicolor -0.9858935 0.19651867 -5.016793 1.589618e-06
5  Speciesvirginica -1.4841365 0.26399083 -5.621924 1.008374e-07

Saving these models:

lm_model2 <- iris %>% 
  do(augment(lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .))) %>% 
  filter(.cooksd < 0.03) %>% 
  lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .)


lm_model1 <- iris %>%
  lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .)

Having done that, is it possible to obtain the regression residuals of the dataset based on the second model.

The only solution I can think of is to use the co-efficients of model 2 to calculate these indirectly ie:

Residual = 2.3927287 + 0.4150542 * Sepal.Width + 0.8035635 * Petal.Length + [-0.9858935 * Speciesversicolor] or + [-1.4841365 * Speciesvirginica] - Sepal.Length

Is there a better way? Something similar to:

residuals <- obtain_residuals(iris, lm_model2)

Many thanks.

2
Did you save your lm model object?Hong Ooi
Why not Sepal.Length - predict(model)???IRTFM
I suppose that is implied by my question.IRTFM
Added code to save the objectsTony2016
Residuals run for each observation (row) of dataset and not on variables of model (column). lm_model1 and lm_model2 are model summaries as dataframes and not aligned to original dataset (N=150).Parfait

2 Answers

1
votes

I think your tidy() removed a lot of the normal output from lm.

mylm<- iris %>% 
    do(augment(lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .))) %>% 
    filter(.cooksd < 0.03) %>% 
    lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .)

head(mylm$residuals)

          1           2           3           4           5           6 
 0.12959260  0.13711970 -0.06553479 -0.28474207 -0.01191282  0.02250186 
0
votes

With help from 42's 'predict' suggestion, I believe the below would work. It can also be turned into a function if so desired.

iris %>% 
  do(augment(lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, .))) %>% 
  filter(.cooksd < 0.03) %>% 
  lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, na.action=na.exclude, data=.) %>% 
  predict(iris) %>% 
  cbind(predicted = ., iris) %>% 
  mutate(residual = Sepal.Length - predicted)

Thank you all for you help and suggestions.