12
votes

When using dplyr's "group_by" and "mutate", if I understand correctly, the data frame is split in different sub-dataframes according to the group_by argument. For example, with the following code :

 set.seed(7)
 df <- data.frame(x=runif(10),let=rep(letters[1:5],each=2))
 df %>% group_by(let) %>% mutate(mean.by.letter = mean(x))

mean() is applied successively to the column x of 5 sub-dfs corresponding to a letter between a & e.

So you can manipulate the columns of the sub-dfs but can you access the sub-dfs themselves ? To my surprise, if I try :

 set.seed(7)
 data <- data.frame(x=runif(10),let=rep(letters[1:5],each=2))
 data %>% group_by(let) %>% mutate(mean.by.letter = mean(.$x))

the result is different. From this result, one can infer that the "." df doesn't represent successively the sub-dfs but just the "data" one (the group_by function doens't change anything).
The reason is that I want to use a stat function that take a data frame as an arguments on each of this sub-dfs. Thanks !

2
You could try with ?doakrun
do.call(rbind, lapply(split(df, df$let), myfun))Frank
I don't understand the question since the accepted answer produces the same as data %>% group_by(let) %>% mutate(mean.by.letter = mean(x)) (unless I'm missing something) but will likely be slower because of the extra do-calltalat
@docendo-discimus : sorry, if it wasn't clear but I didn't want to make it too long, so I used an over simplified exemple. And, you're right, in this simple case, I could have the simpler solution (ie the one that you repeat). But as I tried to explain in the end of my question, it is not possible to use the same solution once you need to the whole sub-dataframes as an argument of your stat function (and not just one of their column like with the x in mean()...)godot

2 Answers

9
votes

We can use within do

data %>%
    group_by(let ) %>% 
    do(mutate(., mean.by.letter = mean(.$x)))
3
votes

Since dplyr 0.8 you can use group_map, the . in the group_map call will represent the sub-data.frame. Its behavior has changed a bit with time, with dplyr 1.0 we can do

df <- data.frame(x=runif(10),let=rep(letters[1:5],each=2))
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df %>%   
  group_by(let) %>%
  group_map(~mutate(., mean.by.letter = mean(x)), .keep = T) %>%  
  bind_rows()
#> # A tibble: 10 x 3
#>         x let   mean.by.letter
#>     <dbl> <chr>          <dbl>
#>  1 0.442  a              0.271
#>  2 0.0999 a              0.271
#>  3 0.669  b              0.343
#>  4 0.0167 b              0.343
#>  5 0.908  c              0.575
#>  6 0.242  c              0.575
#>  7 0.685  d              0.378
#>  8 0.0716 d              0.378
#>  9 0.883  e              0.843
#> 10 0.804  e              0.843

group_map() was introduced there (with now outdated behavior!):

https://www.tidyverse.org/articles/2019/02/dplyr-0-8-0/ https://www.tidyverse.org/articles/2018/12/dplyr-0-8-0-release-candidate/