0
votes

I have a simple example using dplyr (version 0.2)

I want a cumulative sum of var1 by ID. It works fine with ddply from plyr but not the new package. Is this a user error? If so can anyone point me in the right direction?

  ID<-c(1,1,1,1,2,2,3,4,4,4,4)
    var1<-c(32,55,22,12,34,21,23,42,11,9,20)
    df<-data.frame(ID=ID,var1=var1)
    df

#does not create cumsum by ID    
IDs<-group_by(df,'ID')
transform(IDs,cumsum=cumsum(var1))

   ID var1 cumsum
1   1   32     32
2   1   55     87
3   1   22    109
4   1   12    121
5   2   34    155
6   2   21    176
7   3   23    199
8   4   42    241
9   4   11    252
10  4    9    261
11  4   20    281

#works correctly
ddply(.data=df, .variables=('ID'),.fun=transform,cumsum=cumsum(var1))


   ID var1 cumsum
1   1   32     32
2   1   55     87
3   1   22    109
4   1   12    121
5   2   34     34
6   2   21     55
7   3   23     23
8   4   42     42
9   4   11     53
10  4    9     62
11  4   20     82
1
use mutate, not transform - AndrewMacDonald
DOH! I was also having an issue with mutate due to plyr also being loaded. dplyr::mutate(IDs,cumsum=cumsum(var1)) works perfect! - B_Miner
YES. In fact I have had that same problem so much that I have started loading only dplyr and summoning plyr::whatever only at great need - AndrewMacDonald
or wrap transform in a do: IDs %>% do(transform(., cumsum = cumsum(var1))) - G. Grothendieck

1 Answers

0
votes

group_by changes the class and attributes of the original table (and occasionally adds columns).

If the function to which you feed your new table doesn't recognise this (so typically, if it's not a dplyr verb), it will treat it like a regular ungrouped table.

So transform(IDs,cumsum=cumsum(var1)) will not work as intended while mutate(IDs,cumsum=cumsum(var1)) will.

do is a dplyr verb so do(IDs,transform(., cumsum = cumsum(var1))) will work as well.

The class and attributes of your grouped table:

ID<-c(1,1,1,1,2,2,3,4,4,4,4)
var1<-c(32,55,22,12,34,21,23,42,11,9,20)
df<-data.frame(ID=ID,var1=var1)
IDs<-group_by(df,ID) # without quotes!

class(IDs)
# [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
attributes(IDs)
# $names
# [1] "ID"   "var1"
# 
# $row.names
# [1]  1  2  3  4  5  6  7  8  9 10 11
# 
# $class
# [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
# 
# $vars
# [1] "ID"
# 
# $drop
# [1] TRUE
# 
# $indices
# $indices[[1]]
# [1] 0 1 2 3
# 
# $indices[[2]]
# [1] 4 5
# 
# $indices[[3]]
# [1] 6
# 
# $indices[[4]]
# [1]  7  8  9 10
# 
# 
# $group_sizes
# [1] 4 2 1 4
# 
# $biggest_group_size
# [1] 4
# 
# $labels
# ID
# 1  1
# 2  2
# 3  3
# 4  4

And here's a bonus base R solution:

do.call(rbind,by(df,df$ID,function(IDs){transform(IDs,cumsum=cumsum(var1))}))