Since you are be running the exact same regression for each group, you might find it simpler to just define your regression model as a function()
beforehand, and then execute it for each group using mutate
.
model<- function(y,x){
a<- y + x
if( length(which(!is.na(a))) <= 2 ){
return( rep(NA, length(a)))
} else {
m<- lm( y ~ x, na.action = na.exclude)
return( residuals(m))
}
}
Note, that the first part of this function is to insure against any error messages popping up in case your regression is run on a group with less than zero degrees of freedom (This might be the case if you have a dataframe
with several grouping variables with many levels
, or numerous independent variables for your regression (like for example lm(y~ x1 + x2)
), and can't afford to inspect each of them for sufficient non-NA observations).
So your example can be rewritten as follows:
iris %>% group_by(Species) %>%
mutate(resid = model(Sepal.Length,Sepal.Width) ) %>%
select(Sepal.Length,Sepal.Width,resid)
Which should yield:
Species Sepal.Length Sepal.Width resid
<fctr> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 0.04428474
2 setosa 4.9 3.0 0.18952960
3 setosa 4.7 3.2 -0.14856834
4 setosa 4.6 3.1 -0.17951937
5 setosa 5.0 3.6 -0.12476423
6 setosa 5.4 3.9 0.06808885
This method should not be computationally much different from the one using augment()
.(I've had to use both methods on data sets containing several hundred million observations, and believe there was no significant difference in terms of speed compared to using the do()
function).
Also, please note that omitting na.action = na.exclude
, or using m$residuals
instead of residuals(m)
, will result in the exclusion of rows that have NAs (dropped prior to estimation) from the output vector of residuals. The corresponding vector will thus not have sufficient length()
in order to be merged with the data set, and some error message might appear.