dplyr::mutate()
will take multiple rows as inputs to functions on the right hand side of the equation(s) that are arguments to mutate()
. As noted in the comments, one can use group_by()
to break the inputs on the right hand side functions into subgroups. This eliminates the need for conditional logic in mutate()
as specified in the original question.
We'll illustrate by calculating cond_disp
from the original post, and include n
to count the number of rows included in the summary data.
mtcars %>% group_by(vs) %>%
mutate(cond_disp = sum(disp),
n = n()) -> result
result[,c("vs","n","cond_disp","disp")]
# A tibble: 32 x 4
# Groups: vs [2]
vs n cond_disp disp
<dbl> <int> <dbl> <dbl>
1 0 18 5529. 160
2 0 18 5529. 160
3 1 14 1854. 108
4 1 14 1854. 258
5 0 18 5529. 360
6 1 14 1854. 225
7 0 18 5529. 360
8 1 14 1854. 147.
9 1 14 1854. 141.
10 1 14 1854. 168.
# … with 22 more rows
The mutate()
approach is useful when one needs to calculate percentage values row by row where the denominator of the percentage is a sum of a column within a combination of by groups. To illustrate, we'll calculate percentage of total displacement for V versus straight engines, print the results, and print the sum of pct_disp
to illustrate that it equals 100 for V engines.
mtcars %>% group_by(vs) %>%
mutate(pct_disp = 100* disp / sum(disp),
n = n()) -> result
result[result$vs==0,c("vs","n","disp","pct_disp")]
sum(result$pct_disp[result$vs==0])
# A tibble: 18 x 4
# Groups: vs [1]
vs n disp pct_disp
<dbl> <int> <dbl> <dbl>
1 0 18 160 2.89
2 0 18 160 2.89
3 0 18 360 6.51
4 0 18 360 6.51
5 0 18 276. 4.99
6 0 18 276. 4.99
7 0 18 276. 4.99
8 0 18 472 8.54
9 0 18 460 8.32
10 0 18 440 7.96
11 0 18 318 5.75
12 0 18 304 5.50
13 0 18 350 6.33
14 0 18 400 7.23
15 0 18 120. 2.18
16 0 18 351 6.35
17 0 18 145 2.62
18 0 18 301 5.44
> sum(result$pct_disp[result$vs==0])
[1] 100
When to use summarise()
dplyr::summarise()
is useful if one wants to summarise the data without adding additional column(s) to the input data frame in the pipeline. The result of summarise()
is one row for each combination of variables in the group_by()
specification in the pipeline, and the column(s) for the summarized data.
mtcars %>% group_by(vs) %>%
summarise(cond_disp = sum(disp),
n = n())
# A tibble: 2 x 3
vs cond_disp n
<dbl> <dbl> <int>
1 0 5529. 18
2 1 1854. 14
row by row calculations
If one needs to use R functions to calculate values across columns within a row, one can use the rowwise()
function to prevent mutate()
from using multiple rows in the functions on the right hand side of equations within mutate()
.
To illustrate, we'll sum the values of vs
, am
. Notice that the result of n = n()
in the output is 1 for each row printed.
mtcars %>% rowwise(.) %>%
mutate(cond_binary = sum(vs,am),
n = n()) -> result
result[,c("vs","am","n","cond_binary")]
# A tibble: 32 x 4
# Rowwise:
vs am n cond_binary
<dbl> <dbl> <int> <dbl>
1 0 1 1 1
2 0 1 1 1
3 1 1 1 2
4 1 0 1 1
5 0 0 1 0
6 1 0 1 1
7 0 0 1 0
8 1 0 1 1
9 1 0 1 1
10 1 0 1 1
# … with 22 more rows
summarise
rather thanmutate
if you want to sum values. If you want to do it according to subsets justgroup_by
first. Trymtcars %>% group_by(vs) %>% summarise(cond_disp = sum(disp))
– Allan Cameron