2
votes

As described in numerous questions on here, I should be able to take a data.frame, group it, sort by date, and then apply cumsum, to get the cumulative sum over time per grouping.

Instead, with dplyr 0.8.0, I'm getting cumulative sums that ignore the grouping.

Example code:

data.frame(
  cat = sample(c("a", "b", "c"), size = 1000, replace = T),
  date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)
) %>%
  mutate(
    x = 1
  ) %>% 
  arrange(date) %>%
  group_by(cat) %>%
  mutate(x = cumsum(x)) %>%
  tail()

Now, I'd expect the last few rows to have x equal to around 300-something, for each group.

Instead I get:

# A tibble: 6 x 3
# Groups:   cat [2]
  cat   date           x
  <chr> <date>     <dbl>
1 a     1999-12-31   995
2 a     1999-12-31   996
3 c     2000-01-01   997
4 a     2000-01-01   998
5 c     2000-01-01   999
6 a     2000-01-01  1000

What am I doing wrong?

1
I cannot reproduce your numbers. For me, all x values are roughly around 300. - coffeinjunky
For reference, I tried this using dplyr 0.7.2. - coffeinjunky
Can you tell me if you get the same results in dplyr 0.8.0? A part of me will feel better if its a reversion... - Bob
When you submit a bug-report, I suggest you make the problem significantly smaller (perhaps 4 rows), either with static data or with set.seed. (I suggest you could demonstrate grouping problem without generating 1000 randoms, such as data_frame(a=rep(1:2,2),b=1:4) %>% group_by(a) %>% mutate(x=cumsum(b)), expecting 1,2,4,6.) - r2evans
Yup, its a reversion in 0.8.0... - Bob

1 Answers

2
votes

I'm guessing this is a classic problem when you load plyr after dplyr, nothing to do with your version of dplyr. For example:

tmp1<- data.frame(cat = sample(c("a", "b", "c"), size = 1000, replace = T),
date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)) %>%    mutate(x = 1)

see difference between

tmp1 %>% 
arrange(date) %>%
group_by(cat) %>%
plyr::mutate(x = cumsum(x)) %>%
tail()

and

tmp1 %>% 
  arrange(date) %>%
  group_by(cat) %>%
  dplyr::mutate(x = cumsum(x)) %>%
  tail()

plyr's mutate doesn't understand grouping.

You can verify if this is the problem using search()