dplyr: lead() and lag() wrong when used with group_by()

Question

I want to find the lead() and lag() element in each group, but had some wrong results.

For example, data is like this:

library(dplyr)
df = data.frame(name=rep(c('Al','Jen'),3),
                score=rep(c(100, 80, 60),2))
df

Data:

  name score
1   Al   100
2  Jen    80
3   Al    60
4  Jen   100
5   Al    80
6  Jen    60

Now I try to find out lead() and lag() scores for each person. If I sort it using arrange(), I can get the correct answer:

df %>%
  arrange(name) %>%
  group_by(name) %>%
  mutate(next.score = lead(score),
         before.score = lag(score) )

OUTPUT1:

Source: local data frame [6 x 4]
Groups: name

      name score next.score before.score
    1   Al   100         60           NA
    2   Al    60         80          100
    3   Al    80         NA           60
    4  Jen    80        100           NA
    5  Jen   100         60           80
    6  Jen    60         NA          100

Without arrange(), the result is wrong:

df %>%
  group_by(name) %>%
  mutate(next.score = lead(score),
         before.score = lag(score) )

OUTPUT2:

Source: local data frame [6 x 4]
Groups: name

  name score next.score before.score
1   Al   100         80           NA
2  Jen    80         60           NA
3   Al    60        100           80
4  Jen   100         80           60
5   Al    80         NA          100
6  Jen    60         NA           80

E.g., in 1st line, Al's next.score should be 60 (3rd line).

Anybody know why this happened? Why arrange() affects the result (the values, not just about the order)? Thanks~

@DavidArenburg it's not the sorting, the OP asks why the result is 80 when in the original data frame the next result is 60. It's like Jen's result was picked instead of Al's — Panagiotis Kanavos
And I can't repro. Which version of R are you using? I get 1 Al 100 60 NA with R 3.1.2 on Windows 7 — Panagiotis Kanavos
@PanagiotisKanavos, yeah you are right. I didn't notice that. — David Arenburg
I can reproduce the weird results (0.4.1.9000). I think (after a quick, groggy-eyed glance at the source of the series of function calls) it's because the underlying code is going by actual overall row-index instead of the relative row-index. That might explain lead (I think pmin is the place of the weirdness), but not sure what's going on with lag (didn't look there). — hrbrmstr
This seems to be a bug in the latest version 0.4.1 of dplyr and was already reported here — alex23lemm

Tomasz Sosiński Tomasz Sosiński · Accepted Answer · 2015-04-01T15:14:10

It seems you have to pass additional argument to lag and lead functions. When I run your function without arrange, but with order_by added, everything seems to be ok.

df %>%
group_by(name) %>%
mutate(next.score = lead(score, order_by=name),
before.score = lag(score, order_by=name))

Output:

  name score next.score before.score
1   Al   100         60           NA
2  Jen    80        100           NA
3   Al    60         80          100
4  Jen   100         60           80
5   Al    80         NA           60
6  Jen    60         NA          100

My sessionInfo():

R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250        LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.1

loaded via a namespace (and not attached):
[1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5                parallel_3.1.1  Rcpp_0.11.5    
[7] tools_3.1.1

dplyr: lead() and lag() wrong when used with group_by()

3 Answers