Function similar to group_by when groups are not mutually exlcusive

Question

I would like to create a function in R, similar to dplyr's group_by function, that when combined with summarise can give summary statistics for a dataset where group membership is not mutually exclusive. I.e., observations can belong to multiple groups. One way to think about it might be to consider tags; observations may belong to one or more tags which might overlap.

For example, take R's esoph dataset (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html) documenting a case-control study of esophageal cancer. Suppose I'm interested in the number and proportion of cancer cases overall and per 'tag', where the tags are: 65+ years old; 80+ gm/day alcohol; 20+ gm/day tobacco; and a 'high risk' group where the previous 3 criteria are met. Let's transform the dataset to long format (one participant per row) and then add these tags (logical columns) to the dataset:

library('dplyr')
data(esoph)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
                      esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
            ) %>% 
            mutate(highage=(agegp %in% c('65-74','75+')),
                   highalc=(alcgp %in% c('80-119','120+')),
                   hightob=(tobgp %in% c('20-29','30+')),
                   highrisk=(highage & highalc & hightob)
            )

My usual approach is to create a dataset where each observation is duplicated for every tag it belongs to, and then summarise this dataset:

esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'),
                     esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'),
                     esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'),
                     esophlong %>% filter(highrisk) %>% mutate(tag='high risk'),
                     esophlong %>% filter() %>% mutate(tag='all')
           ) %>%
           mutate(tag=factor(tag, levels = unique(.$tag)))

summary = esophdup %>%
          group_by(tag) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

This approach is inefficient for large datasets or for a large number of tags and I will often run out of memory to store it.

An alternative is to summarise each tag separately and then bind these summary datasets afterwards, as follows:

summary.age = esophlong %>%
              filter(highage) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='age>=65')

summary.alc = esophlong %>%
              filter(highalc) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='alc>=80')

summary.tob = esophlong %>%
              filter(hightob) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='tob>=20')

summary.highrisk = esophlong %>%
              filter(highrisk) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='high risk')

summary.all = esophlong %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='all')

summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)

This approach is time-consuming and tedious when I have a large number of tags or I want to reuse the tags often for different summary measures throughout a project.

The function I have in mind, say group_by_tags(data, key, ...), which includes an argument to specify the name of the grouping column, should work something like this:

summary = esophlong %>% 
          group_by_tags(key='tags',
                        'age>=65'=highage,
                        'alc>=80'=highalc,
                        'tob>=20'=hightob,
                        'high risk'=highrisk,
                        'all ages'=1
          ) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

with the summary dataset looking like this:

> summary
       tags     n ncases case.rate
1   age>=65   273     68 0.2490842
2   alc>=80   301     96 0.3189369
3   tob>=20   278     64 0.2302158
4 high risk    11      5 0.4545455
5       all  1175    200 0.1702128

Even better, it could take variables of type "factor" as well as "logical" so that it could summarise, say, each age group individually, the 65+ year olds, and everybody:

summaryage = esophlong %>% 
          group_by_tags(key='Age.group',
                        agegp,
                        '65+'=(agegp %in% c('65-74','75+')),
                        'all'=1                 
          ) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

>summaryage
  Age.group     n ncases case.rate
1     25-34   117      1 0.0085470
2     35-44   208      9 0.0432692
3     45-54   259     46 0.1776062
4     55-64   318     76 0.2389937
5     65-74   216     55 0.2546296
6       75+    57     13 0.2280702
7       65+   273     68 0.2490842
8       all  1175    200 0.1702128

Perhaps it's not possible with ... and instead you might need to pass a vector/list of column names for the tags.

Any ideas?

EDIT: to be clear, the solution should take tag/group definitions and the required summary statistics as arguments, rather than being built into the function itself. Either as a two-step data %>% group_by_tags(tags) %>% summarise_tags(stats) or a one-step data %>% summary_tags(tags,stats) process.

Doesn't group_by_tags imply that you would also have data not within each? IOW, your summary would have 10 rows: your 5 plus "! age>=65", etc. — r2evans
@r2evans only if it treated logical variables as factor variables. Though maybe the extension to include factors is a step too far — wjchulme
summarize assumes that the lengths of the grouped rows adds up to the number of rows in the data (which is not the case here), so you'd have to either create your own summarize function or add a class to dplyr, such as "grouped_df_partial". — r2evans

Frank Frank · Accepted Answer · 2016-08-23T21:26:46

This is a variation on @eddi's answer. I am taking the definitions of highage et al as part of the function's job:

library(data.table)
custom_summary = function(DT, tags, stats){
    setDT(DT)
    rows = stack(lapply(tags[-1], function(x) DT[eval(x), which=TRUE]))
    DT[rows$values, eval(stats), by=.(tag = rows$ind)]
}

And some example usage:

data(esoph)
library(dplyr)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
                      esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
            )

custom_summary(
    DT = esophlong, 
    tags = quote(list(
        'age>=65'   = agegp %in% c('65-74','75+'),
        'alc>=80'   = alcgp %in% c('80-119','120+'),
        'tob>=20'   = tobgp %in% c('20-29','30+'),
        'high risk' = eval(substitute(`age>=65` & `alc>=80` & `tob>=20`, as.list(tags))),
        'all ages'  = TRUE
    )),
    stats = quote(list(
        n           = .N, 
        n_cases     = sum(case), 
        case.rate   = mean(case)
    ))
)

         tag    n n_cases case.rate
1:   age>=65  273      68 0.2490842
2:   alc>=80  301      96 0.3189369
3:   tob>=20  278      64 0.2302158
4: high risk   11       5 0.4545455
5:  all ages 1175     200 0.1702128

The technique of using eval inside DT[...] is explained in the data.table FAQ.

Function similar to group_by when groups are not mutually exlcusive

5 Answers

Solution that doesn't work with `grouped_df` objects

Solution that does work with `grouped_df` objects

Function similar to group_by when groups are not mutually exlcusive

5 Answers

Solution that doesn't work with grouped_df objects

Solution that does work with grouped_df objects

Solution that doesn't work with `grouped_df` objects

Solution that does work with `grouped_df` objects