Select a number of top groups from data frame

Question

Is there an efficient way to grab some number of top groups from a data frame in R? For example:

exampleDf <- data.frame(
  subchar = c("facebook", "twitter", "snapchat", "male", "female", "18", "20"),
  superchar = c("social media", "social media", "social media", "gender", "gender", "age", "age"),
  cweight = c(.2, .4, .4, .7, .3, .8, .6),
  groupWeight = c(10, 10, 10, 20, 20, 70, 70)
)

So with dplyr I can group them and sort by group weight with:

sortedDf <- exampleDf %>%
  group_by(superchar) %>%
  arrange(desc(groupWeight))

But is there anyway to select the 'top' groups, like age and gender in this case? Kind of like the slice() dplyr function, but for the whole group rather than rows within the group.

This seems like a weird example because groupWeight is unique within groups, but it's not a group id, so the groups don't really matter. In a case like this, I'd do exampleDf %>% filter(groupWeight %in% sort(unique(groupWeight), decreasing = TRUE)[1:2]). The more interesting case is if groupWeight varies within the group, but then you need to specify what summary function of groupWeight to use (mean, median, max, etc.). And probably the best way is to exampleDf %>% group_by %>% summarize %>% top_n %>% left_join(exampleDf) back to the original data. — Gregor Thomas
@heds1 Sorry should have clarified that, in this case it would be the groups with highest groupWeight. So the rank here would be Age first, then gender, and social media the lowest. — elynagh

Rui Barradas Rui Barradas · Accepted Answer · 2019-12-19T23:03:25

dplyr has a group_indices function that can be used to assign a consecutive group number. Then filter by that new number. In the example below, I will filter/keep the 2 first groups.

library(dplyr)

Top <- 2

sortedDf <- exampleDf %>%
  group_by(superchar) %>%
  arrange(desc(groupWeight)) %>%
  mutate(new_id = group_indices()) %>%
  filter(new_id <= Top) %>%
  select(-new_id)

sortedDf
## A tibble: 4 x 4
## Groups:   superchar [2]
#  subchar superchar cweight groupWeight
#  <fct>   <fct>       <dbl>       <dbl>
#1 18      age           0.8          70
#2 20      age           0.6          70
#3 male    gender        0.7          20
#4 female  gender        0.3          20

Select a number of top groups from data frame

2 Answers