dplyr - Aggregation incorrect?

Question

I have som issues with dplyr and the group_by function not working as expected. Using summarise, I expect to get the mean of var1 for each unique combination of id and year as entered into the group_by statement.

This code should create a df with id-year observations, where I want to aggregate the mean of var 1 for each combination of id and year. However, this is not working as expected and the output ignores id, and only aggregates on year.

df <- data.frame(id=c(1,1,2,2,2,3,3,4,4,5),
                 year=c(2013,2013,2012,2013,2013,2013,2012,2012,2013,2013), 
                 var1=rnorm(10))

dplyr code:

dfagg <- df %.%
  group_by(id, year) %.%
  select(id, year, var1) %.%
  summarise(
    var1=mean(var1)
    )

Result:

> dfagg
Source: local data frame [8 x 2]
Groups: year

  year        var1
1 2013  0.22924025
2 2012 -0.93073687
3 2013 -0.82351583
4 2012  0.05656113
5 2013 -0.21622021
6 2012  1.91158209
7 2013 -2.67003628
8 2013 -0.72662276

Any idea what is going on?

To make sure no other package interrupted the dplyr functions i tried the below with same result.

dfagg <- df %.%
  dplyr::group_by(id, year) %.%
  dplyr::select(id, year, var1) %.%
  dplyr::summarise(
    var1=mean(var1)
    )

I cannot reproduce this (dplyr 0.1.1). Have you tried restarting R? — lukeA
you need to put select() either before the group_by() or after the summarize() call — Troy
@Troy, thanks. That solved the issue. However, I cannot seem to remember having to put select() before group_by() earlier. Maybe this is not necessary for it to work when there is only one group_by variable? — spesseh
@spesseh - yes, it seems to roll back to the last grouping variable - not sure if the behaviour is expected: probably it's a bug to be reported — Troy
Looks like a bug. I didn't think through all of the issues with select() and group_by() so there is still some bad behaviour. Can you please file a bug at github.com/hadley/dplyr/issues? — hadley

Dan Dan · Accepted Answer · 2017-09-15T10:54:29

I don't think you need the select() line. Just using the group_by() and summarise() did the trick for me.

library(dplyr)

df <- data.frame(id=c(1,1,2,2,2,3,3,4,4,5),
                 year=c(2013,2013,2012,2013,2013,2013,2012,2012,2013,2013), 
                 var1=rnorm(10))
df %>%
  group_by(id, year) %>%
  summarise(mean_var1=mean(var1)) -> dfagg

Result:

     id  year   mean_var1
  (dbl) (dbl)       (dbl)
1     1  2013 -1.20744511
2     2  2012 -0.59159641
3     2  2013 -0.03660552
4     3  2012 -0.38853566
5     3  2013 -1.76459495
6     4  2012 -0.66926387
7     4  2013  0.70451751
8     5  2013 -0.82762769

dplyr - Aggregation incorrect?

1 Answers