Summarizing with overlapping groups using dplyr

Question

What is the best way to create overlapping groups using dplyr?

For example, imagine you have the following data set

test <- data.frame(year = rep(as.character(2014:2016), 2), value = 1:6)

And you want to summarize using a group for each year, and then full period. Two ways to do this could be as follows:

Using bind_rows and mutate (and probably filter in a more complex example)

year.totals <- bind_rows(test %>% mutate(year = "2014:2016"),
                         test) %>% 
  group_by(year) %>% 
  summarize(value = sum(value))

Using gather

year.totals.2 <- test %>% 
  mutate(year.2 = "2014:2016") %>% 
  gather(drop, year, year, year.2) %>% 
  group_by(year) %>% 
  summarize(value = sum(value))

Is there a better way to do this?

I've also seen this question and answer, which I don't think is too bad but I would prefer to avoid the lapply.

You can do this in base R i.e. addmargins(rowsum(test$value, test$year), 1) — akrun
I want something that will work with a more complex example too such as where a group could be "2014:2015" and the summary statistic could be an average, or a weighted average etc. — ColinTea
I asked a very similar question (stackoverflow.com/q/39106850/4269699) though it's a bit more general. tl;dr - dplyr doesn't have this functionality explicitly. I still use the bind_rows() approach in the absence of anything better — wjchulme
This is a post I made about the gather method https://colintb.github.io/overlapping_groups-example/ — ColinTea

eipi10 eipi10 · Accepted Answer · 2017-03-21T18:20:35

You could create "rolling groups" and summarize by those. I've used the map function from the purrr package (which, like dplyr, is one of the tidyverse packages).

First, let's create two helper functions. Depending on your needs, you can generalize these to allow flexible choice of grouping column(s), choice of summary statistics, etc.

library(tidyverse)

# Create rolling year groups
year.groups = function(start.years, n) {
  ll = length(start.years)
  start.years = sort(start.years[-c((ll-(n-2)):ll)])
  map(as.numeric(as.character(start.years)), function(x) x:(x+(n-1)))
}

# Summarise by rolling group
roll.group = function(vec) {
   vec %>% map_df(~ test %>% 
         filter(year %in% .x) %>%
         group_by(year = paste(year[which.min(year)], year[which.max(year)], sep="-")) %>%
         summarise(value=sum(value)))
}

Now summarize by rolling group:

# Summarise by rolling year group and individual years
bind_rows(year.groups(unique(test$year), 2) %>% 
            roll.group(),
          year.groups(unique(test$year), 3) %>%
            roll.group(),
          test %>% group_by(year) %>%
            summarise(value=sum(value)))

       year value
1 2014-2015    12
2 2015-2016    16
3 2014-2016    21
4      2014     5
5      2015     7
6      2016     9

You asked for the "best" way to create overlapping groups in dplyr. I'd be surprised if this is the best way, but at least it seems to be a way.

Summarizing with overlapping groups using dplyr

1 Answers