6
votes

I want to use dplyr to group a data.frame, fit linear regressions and save the residuals as a column in the original, ungrouped data.frame.

Here's an example

> iris %>%
   select(Sepal.Length, Sepal.Width) %>%
   group_by(Species) %>%
   do(mod = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%

Returns:

     Species     mod
1     setosa <S3:lm>
2 versicolor <S3:lm>
3  virginica <S3:lm>

Instead, I would like the original data.frame with a new column containing residuals.

For example,

    Sepal.Length Sepal.Width  resid
1   5.1         3.5  0.04428474
2   4.9         3.0  0.18952960
3   4.7         3.2 -0.14856834
4   4.6         3.1 -0.17951937
5   5.0         3.6 -0.12476423
6   5.4         3.9  0.06808885
3

3 Answers

8
votes

I adapted an example from http://jimhester.github.io/plyrToDplyr/.

r <- iris %>%
  group_by(Species) %>%
  do(model = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
  do((function(mod) {
     data.frame(resid = residuals(mod$model))
  })(.))

corrected <- cbind(iris, r)

update Another method is to use the augment function in the broom package:

r <- iris %>%
  group_by(Species) %>%
  do(augment(lm(Sepal.Length ~ Sepal.Width, data=.))

Which returns:

Source: local data frame [150 x 10]
Groups: Species

   Species Sepal.Length Sepal.Width  .fitted    .se.fit      .resid       .hat
1   setosa          5.1         3.5 5.055715 0.03435031  0.04428474 0.02073628
2   setosa          4.9         3.0 4.710470 0.05117134  0.18952960 0.04601750
3   setosa          4.7         3.2 4.848568 0.03947370 -0.14856834 0.02738325
4   setosa          4.6         3.1 4.779519 0.04480537 -0.17951937 0.03528008
5   setosa          5.0         3.6 5.124764 0.03710984 -0.12476423 0.02420180
...
3
votes

A solution that seems to be easier than the ones proposed so far and closer to the code of the original question is :

iris %>%
   group_by(Species) %>%
   do(data.frame(., resid = residuals(lm(Sepal.Length ~ Sepal.Width, data=.))))

Result :

# A tibble: 150 x 6
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   resid
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>
 1          5.1         3.5          1.4         0.2 setosa   0.0443
 2          4.9         3            1.4         0.2 setosa   0.190 
 3          4.7         3.2          1.3         0.2 setosa  -0.149 
 4          4.6         3.1          1.5         0.2 setosa  -0.180 
 5          5           3.6          1.4         0.2 setosa  -0.125 
 6          5.4         3.9          1.7         0.4 setosa   0.0681
 7          4.6         3.4          1.4         0.3 setosa  -0.387 
 8          5           3.4          1.5         0.2 setosa   0.0133
 9          4.4         2.9          1.4         0.2 setosa  -0.241 
10          4.9         3.1          1.5         0.1 setosa   0.120 
1
votes

Since you are be running the exact same regression for each group, you might find it simpler to just define your regression model as a function() beforehand, and then execute it for each group using mutate.

model<- function(y,x){ 
  a<- y + x 
  if( length(which(!is.na(a))) <= 2  ){
    return( rep(NA, length(a)))
  } else {
    m<- lm( y ~ x, na.action = na.exclude)
    return( residuals(m))
    } 
}

Note, that the first part of this function is to insure against any error messages popping up in case your regression is run on a group with less than zero degrees of freedom (This might be the case if you have a dataframe with several grouping variables with many levels , or numerous independent variables for your regression (like for example lm(y~ x1 + x2)), and can't afford to inspect each of them for sufficient non-NA observations).

So your example can be rewritten as follows:

iris %>% group_by(Species) %>% 
  mutate(resid = model(Sepal.Length,Sepal.Width) ) %>% 
  select(Sepal.Length,Sepal.Width,resid)

Which should yield:

   Species Sepal.Length Sepal.Width       resid
    <fctr>        <dbl>       <dbl>       <dbl>
1   setosa          5.1         3.5  0.04428474
2   setosa          4.9         3.0  0.18952960
3   setosa          4.7         3.2 -0.14856834
4   setosa          4.6         3.1 -0.17951937
5   setosa          5.0         3.6 -0.12476423
6   setosa          5.4         3.9  0.06808885

This method should not be computationally much different from the one using augment().(I've had to use both methods on data sets containing several hundred million observations, and believe there was no significant difference in terms of speed compared to using the do() function).

Also, please note that omitting na.action = na.exclude, or using m$residuals instead of residuals(m), will result in the exclusion of rows that have NAs (dropped prior to estimation) from the output vector of residuals. The corresponding vector will thus not have sufficient length() in order to be merged with the data set, and some error message might appear.