9
votes

I'm trying to create a window function with dplyr, that will return a new vector with the difference between each value and the first of its group. For example, given this dataset:

dummy <- data.frame(userId=rep(1,6),
     libId=rep(999,6),
     curatorId=c(1:2,1:2,1:2),
     iterationNum=c(0,0,1,1,2,2),
     rf=c(5,10,0,15,30,40)
)

That creates this dataset:

  userId libId curatorId iterationNum rf
1      1   999         1            0  5
2      1   999         2            0 10
3      1   999         1            1  0
4      1   999         2            1 15
5      1   999         1            2 30
6      1   999         2            2 40

And given this grouping:

 dummy<-group_by(dummy,libId,userId,curatorId)

Would give this result:

  userId libId curatorId iterationNum   rf   rf.diff
1      1   999         1            0  5    0
2      1   999         2            0 10    0
3      1   999         1            1  0   -5
4      1   999         2            1 15   -5
5      1   999         1            2 30    25
6      1   999         2            2 40    30

So for each group of users, libs and curators, I would get the rf value, minus the rf value with iterationNum=0. I tried playing with the first function, the rank function and others, but couldn't find a way to nail it.

---EDIT---

This is what I tried:

dummy %>% 
  group_by(userId,libId,curatorId) %>% 
  mutate(rf.diff = rf - subset(dummy,iterationNum==0)[['rf']])

And:

dummy %>% 
  group_by(userId,libId,curatorId) %>% 
  mutate(rf.diff = rf - first(x = rf,order_by=iterationNum))

Which crashes R and returns this error message:

pure virtual method called terminate called after throwing an instance of 'Rcpp::exception' what(): incompatible size (%d), expecting %d (the group size) or 1`

1
It seems that you already know all the functions you need to do this. Can you show what you tried and what did not work as expected? Perhaps you just need to arrange (order) your data before computing the differences.talat
You were close. Use rf - rf[iterationNum == 0] inside the mutate instead. The other option is to arrange the data using arrange(iterationNum) as a separate step in the pipe and the use rf - first(rf) in the mutate if you are sure that each group has a 0 in rf and no lower values.talat
rf - first(rf, iterationNum)hadley
Thanks @docendodiscimus! that worked! How do I make sure the order is correct with this syntax?Omri374
@hadley, I got an error: First it said "Error: all arguments of 'first' after the first one should be named". Then when I wrote mutate(rf.diff=rf-first(rf,order_by=iterationNum) my R session crashed with this message: pure virtual method calledOmri374

1 Answers

6
votes

The two approaches I commented above are as follows.

dummy %>%
  group_by(libId, userId, curatorId) %>%
  mutate(rf.diff = rf - rf[iterationNum == 0])
#Source: local data frame [6 x 6]
#Groups: libId, userId, curatorId
#
#  userId libId curatorId iterationNum rf rf.diff
#1      1   999         1            0  5       0
#2      1   999         2            0 10       0
#3      1   999         1            1  0      -5
#4      1   999         2            1 15       5
#5      1   999         1            2 30      25
#6      1   999         2            2 40      30

Or using arrange to order the data by iterationNum:

dummy %>%
  arrange(iterationNum) %>%
  group_by(libId, userId, curatorId) %>%
  mutate(rf.diff = rf - first(rf))
#Source: local data frame [6 x 6]
#Groups: libId, userId, curatorId
#
#  userId libId curatorId iterationNum rf rf.diff
#1      1   999         1            0  5       0
#2      1   999         2            0 10       0
#3      1   999         1            1  0      -5
#4      1   999         2            1 15       5
#5      1   999         1            2 30      25
#6      1   999         2            2 40      30

As you can see, both produce the same output for the sample data.