Using group_by and summarise from dplyr for all rows not containing the variable to group_by

Question

I have a data.frame with such as

df1 <- data.frame(id = c("A", "A", "B", "B", "B"), 
                  cost = c(100, 10, 120, 102, 102)

I know that I can use

df1.a <- group_by(df1, id) %>%
    summarise(no.c = n(), 
              m.costs = mean(cost))

to calculate the number of observations and mean by id. How could I do so if I want to calculate the number of observations and mean for all rows that are NOT equal to the ID, so it would for example give me 3 as value for observations not A and 2 for observations not B.

I would like to use the dplyr package and the group_by functions since I have to this for a lot of huge dataframes.

alistaire alistaire · Accepted Answer · 2016-11-20T00:18:35

You can use the . to refer to the whole data.frame, which lets you calculate the differences between the group and the whole:

df1 %>% group_by(id) %>% 
    summarise(n = n(), 
              n_other = nrow(.) - n, 
              mean_cost = mean(cost), 
              mean_other = (sum(.$cost) - sum(cost)) / n_other)

## # A tibble: 2 × 5
##       id     n n_other mean_cost mean_other
##   <fctr> <int>   <int>     <dbl>      <dbl>
## 1      A     2       3        55        108
## 2      B     3       2       108         55

As you can see from the results, with two groups you could just use rev, but this approach will scale to more groups or calculations easily.

Using group_by and summarise from dplyr for all rows not containing the variable to group_by

2 Answers