0
votes

I'm trying to sum sub groups with dplyr using mutate, but this means I need to know each row group to create a conditional. The idea is not to change the current grouping, as the code below shows:

mutate(mtcars, cond_disp = ifelse(vs==1,sum(disp[vs==1]),sum(disp[vs==0])))

Is there a dynamic method to achieve this?

2
You can change the grouping back and forthSotos
I think you want summarise rather than mutate if you want to sum values. If you want to do it according to subsets just group_by first. Try mtcars %>% group_by(vs) %>% summarise(cond_disp = sum(disp))Allan Cameron

2 Answers

2
votes

You can do it with group_by(vs) and summarise. But as wished here is the version with mutate.

library(tidyverse)

mtcars %>% 
  group_by(vs) %>% 
  #summarise(cond_disp = sum(disp))
  mutate(cond_disp = sum(disp))
#> # A tibble: 32 x 12
#> # Groups:   vs [2]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cond_disp
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     5529.
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     5529.
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     1854.
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     1854.
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     5529.
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     1854.
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     5529.
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     1854.
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     1854.
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     1854.
#> # … with 22 more rows
1
votes

dplyr::mutate() will take multiple rows as inputs to functions on the right hand side of the equation(s) that are arguments to mutate(). As noted in the comments, one can use group_by() to break the inputs on the right hand side functions into subgroups. This eliminates the need for conditional logic in mutate() as specified in the original question.

We'll illustrate by calculating cond_disp from the original post, and include n to count the number of rows included in the summary data.

mtcars %>% group_by(vs) %>% 
     mutate(cond_disp = sum(disp),
            n = n()) -> result
result[,c("vs","n","cond_disp","disp")]


# A tibble: 32 x 4
# Groups:   vs [2]
      vs     n cond_disp  disp
   <dbl> <int>     <dbl> <dbl>
 1     0    18     5529.  160 
 2     0    18     5529.  160 
 3     1    14     1854.  108 
 4     1    14     1854.  258 
 5     0    18     5529.  360 
 6     1    14     1854.  225 
 7     0    18     5529.  360 
 8     1    14     1854.  147.
 9     1    14     1854.  141.
10     1    14     1854.  168.
# … with 22 more rows

The mutate() approach is useful when one needs to calculate percentage values row by row where the denominator of the percentage is a sum of a column within a combination of by groups. To illustrate, we'll calculate percentage of total displacement for V versus straight engines, print the results, and print the sum of pct_disp to illustrate that it equals 100 for V engines.

mtcars %>% group_by(vs) %>% 
     mutate(pct_disp = 100* disp / sum(disp),
            n = n()) -> result
result[result$vs==0,c("vs","n","disp","pct_disp")]
sum(result$pct_disp[result$vs==0])


# A tibble: 18 x 4
# Groups:   vs [1]
      vs     n  disp pct_disp
   <dbl> <int> <dbl>    <dbl>
 1     0    18  160      2.89
 2     0    18  160      2.89
 3     0    18  360      6.51
 4     0    18  360      6.51
 5     0    18  276.     4.99
 6     0    18  276.     4.99
 7     0    18  276.     4.99
 8     0    18  472      8.54
 9     0    18  460      8.32
10     0    18  440      7.96
11     0    18  318      5.75
12     0    18  304      5.50
13     0    18  350      6.33
14     0    18  400      7.23
15     0    18  120.     2.18
16     0    18  351      6.35
17     0    18  145      2.62
18     0    18  301      5.44
> sum(result$pct_disp[result$vs==0])
[1] 100

When to use summarise()

dplyr::summarise() is useful if one wants to summarise the data without adding additional column(s) to the input data frame in the pipeline. The result of summarise() is one row for each combination of variables in the group_by() specification in the pipeline, and the column(s) for the summarized data.

mtcars %>% group_by(vs) %>% 
     summarise(cond_disp = sum(disp),
            n = n()) 
# A tibble: 2 x 3
     vs cond_disp     n
  <dbl>     <dbl> <int>
1     0     5529.    18
2     1     1854.    14

row by row calculations

If one needs to use R functions to calculate values across columns within a row, one can use the rowwise() function to prevent mutate() from using multiple rows in the functions on the right hand side of equations within mutate().

To illustrate, we'll sum the values of vs, am. Notice that the result of n = n() in the output is 1 for each row printed.

mtcars %>% rowwise(.) %>% 
     mutate(cond_binary = sum(vs,am),
            n = n()) -> result
result[,c("vs","am","n","cond_binary")]


# A tibble: 32 x 4
# Rowwise: 
      vs    am     n cond_binary
   <dbl> <dbl> <int>       <dbl>
 1     0     1     1           1
 2     0     1     1           1
 3     1     1     1           2
 4     1     0     1           1
 5     0     0     1           0
 6     1     0     1           1
 7     0     0     1           0
 8     1     0     1           1
 9     1     0     1           1
10     1     0     1           1
# … with 22 more rows