0
votes

In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:

data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))

Output:

  a         b sum_c
1 a         1     4
2 b         2     5
3 c         3     6

Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add variable x_v to the summary.

I see several related stackoverflow questions, but I didn't see this.

EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).

summarize_num <- TRUE
summarize_num_distinct <- FALSE

data <- data.frame(val=c(1,2,2))

if (summarize_num && summarize_num_distinct) {
  summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
  summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
  summ <- data %>% summarize(n_unique=n_distinct())
}

Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.

As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.

I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.

2
Would using dplyr::mutate instead of summarise work, i.e. you get the per group effect and can then filter based on any condition?biomiha
@akrun Reasonable request. I added some example code. It works for the simple case, but gets combinatorially difficult for many conditions. I'd like it to remain simple if I have many conditions.dfrankow
@biomiha Good thought, but I think no. In the real example, I am using group_by clauses, so I think I need to summarise rather than mutate.dfrankow
Mutate still mutates by group (definitely with dplyr, not 100% sure about dbplyr).biomiha

2 Answers

0
votes

Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.

data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups

summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))} 
summ = slice(summ, 1) %>% ungroup() %>% select(-val)

## A tibble: 2 x 3
#    grp     n n_unique
#  <dbl> <int>    <int>
#1     1     2        2
#2     2     1        1
0
votes

The summarise_at() function takes a list of functions as parameter. So, we can get

data <- data.frame(val=c(1,2,2))

fcts <- list(n_unique = n_distinct, n = length)
data %>% 
  summarise_at(.vars = "val", fcts)
  n_unique n
1        2 3

All functions in the list must take one argument. Therefore, n() was replaced by length().

The list of functions can be modified dynamically as requested by the OP, e.g.,

summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>% 
  summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
  n
1 3

So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:

fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>% 
  summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
  max n      avg sum
1   2 3 1.666667   5