I have som issues with dplyr and the group_by function not working as expected. Using summarise, I expect to get the mean of var1 for each unique combination of id and year as entered into the group_by statement.
This code should create a df with id-year observations, where I want to aggregate the mean of var 1 for each combination of id and year. However, this is not working as expected and the output ignores id, and only aggregates on year.
df <- data.frame(id=c(1,1,2,2,2,3,3,4,4,5),
year=c(2013,2013,2012,2013,2013,2013,2012,2012,2013,2013),
var1=rnorm(10))
dplyr code:
dfagg <- df %.%
group_by(id, year) %.%
select(id, year, var1) %.%
summarise(
var1=mean(var1)
)
Result:
> dfagg
Source: local data frame [8 x 2]
Groups: year
year var1
1 2013 0.22924025
2 2012 -0.93073687
3 2013 -0.82351583
4 2012 0.05656113
5 2013 -0.21622021
6 2012 1.91158209
7 2013 -2.67003628
8 2013 -0.72662276
Any idea what is going on?
To make sure no other package interrupted the dplyr functions i tried the below with same result.
dfagg <- df %.%
dplyr::group_by(id, year) %.%
dplyr::select(id, year, var1) %.%
dplyr::summarise(
var1=mean(var1)
)
select()
either before thegroup_by()
or after thesummarize()
call – Troyselect()
andgroup_by()
so there is still some bad behaviour. Can you please file a bug at github.com/hadley/dplyr/issues? – hadley