1
votes

Is there an efficient way to grab some number of top groups from a data frame in R? For example:

exampleDf <- data.frame(
  subchar = c("facebook", "twitter", "snapchat", "male", "female", "18", "20"),
  superchar = c("social media", "social media", "social media", "gender", "gender", "age", "age"),
  cweight = c(.2, .4, .4, .7, .3, .8, .6),
  groupWeight = c(10, 10, 10, 20, 20, 70, 70)
)

So with dplyr I can group them and sort by group weight with:

sortedDf <- exampleDf %>%
  group_by(superchar) %>%
  arrange(desc(groupWeight))

But is there anyway to select the 'top' groups, like age and gender in this case? Kind of like the slice() dplyr function, but for the whole group rather than rows within the group.

2
What exactly do you mean by 'top groups'?heds1
This seems like a weird example because groupWeight is unique within groups, but it's not a group id, so the groups don't really matter. In a case like this, I'd do exampleDf %>% filter(groupWeight %in% sort(unique(groupWeight), decreasing = TRUE)[1:2]). The more interesting case is if groupWeight varies within the group, but then you need to specify what summary function of groupWeight to use (mean, median, max, etc.). And probably the best way is to exampleDf %>% group_by %>% summarize %>% top_n %>% left_join(exampleDf) back to the original data.Gregor Thomas
@heds1 Sorry should have clarified that, in this case it would be the groups with highest groupWeight. So the rank here would be Age first, then gender, and social media the lowest.elynagh

2 Answers

2
votes

dplyr has a group_indices function that can be used to assign a consecutive group number. Then filter by that new number. In the example below, I will filter/keep the 2 first groups.

library(dplyr)

Top <- 2

sortedDf <- exampleDf %>%
  group_by(superchar) %>%
  arrange(desc(groupWeight)) %>%
  mutate(new_id = group_indices()) %>%
  filter(new_id <= Top) %>%
  select(-new_id)

sortedDf
## A tibble: 4 x 4
## Groups:   superchar [2]
#  subchar superchar cweight groupWeight
#  <fct>   <fct>       <dbl>       <dbl>
#1 18      age           0.8          70
#2 20      age           0.6          70
#3 male    gender        0.7          20
#4 female  gender        0.3          20
1
votes

Here are two other approaches using dplyr :

We calculate sum of groupWeight for each superchar select top 2 records and do a left_join with the original dataframe to select all the rows.

n <- 2
library(dplyr)

exampleDf %>%
  group_by(superchar) %>%
  summarise(sum_gr = sum(groupWeight)) %>%
  top_n(n, sum_gr) %>%
  left_join(exampleDf)

# A tibble: 4 x 5
#  superchar sum_gr subchar cweight groupWeight
#  <fct>      <dbl> <fct>     <dbl>       <dbl>
#1 age          140 18          0.8          70
#2 age          140 20          0.6          70
#3 gender        40 male        0.7          20
#4 gender        40 female      0.3          20

Another approach is to sum groupWeight by superchar and use dense_rank to select top groups.

exampleDf %>%
  group_by(superchar) %>%
  mutate(sum_gr = sum(groupWeight)) %>%
  ungroup() %>%
  filter(dense_rank(-sum_gr) <= n)

The first approach can be written in base R as :

temp <- aggregate(groupWeight~superchar, exampleDf, sum)
temp <- temp[order(temp$groupWeight, decreasing = TRUE), ][1:n, ]
merge(temp, exampleDf, all.x = TRUE, by = 'superchar')