1
votes

So I've got an ID an event, and I want to use group_by (or some similar function) to do a conditional cumulative sum. Here's the data:

ID  Event
42  NA
42  1
42  2
42  NA
42  1
43  NA
43  1
43  2
43  2

and what I want to do is have two new columns that count the 1s and the 2s, cumulatively, without collapsing any of the data:

ID  Event   count_1s    count_2s
42  NA      0           0
42  1       1           0
42  2       1           1
42  NA      1           1
42  1       2           1
43  NA      0           0
43  1       1           0
43  2       1           1
43  2       1           2

So I understand how to use group_by to sum them all up by ID, something like this:

t <- data %>% group_by(ID, Event) %>% summarize(count_1s = sum(!is.na(Event == 1)))

But what I can't understand is how to get a running conditional sum -- seems like group_by will collapse my data and I need to maintain every row.

EDIT: so the accepted answer works perfectly, but just one more question. What if the values are different, by event? For example:

ID  Event   count_a count_b
42  NA      0           0
42  1       1           0
42  2       1           1
42  NA      1           1
42  1       2           1
43  NA      0           0
43  3       1           0
43  4       1           1
43  4       1           2 

There will always only be two Event values per ID, (doesn't matter which is a and which is b) and I want them to reset each time.

1

1 Answers

6
votes

After grouping by the 'ID' column, we can create the 'count_1s' by comparing the elements in 'Event' that is equal to 1 and is not an NA, then cumsum the logical index. In a similar way, we can create the 'count_2s'.

 library(dplyr)
 data %>% 
    group_by(ID) %>%
    mutate(count_1s= cumsum(Event==1 & !is.na(Event)),
           count_2s= cumsum(Event==2 & !is.na(Event)))
#     ID Event count_1s count_2s
#  (int) (int)    (int)    (int)
#1    42    NA        0        0
#2    42     1        1        0
#3    42     2        1        1
#4    42    NA        1        1
#5    42     1        2        1
#6    43    NA        0        0
#7    43     1        1        0
#8    43     2        1        1
#9    43     2        1        2

Update

Using the OP's updated dataset, after we group by 'ID', we can create a new column 'Event1', by converting the 'Event' to factor class and then coerce it back to numeric class (or another option is matching the 'Event' with unique elements of 'Event'), and then create the 'count_a', and 'count_b' as before.

data2 %>%
   group_by(ID) %>% 
   mutate(Event1= as.numeric(factor(Event, levels=unique(Event))), 
          count_a= cumsum(Event1==1 & !is.na(Event1)), 
          count_b= cumsum(Event1==2 & !is.na(Event1))) %>%
   select(-Event1)
#    ID Event count_a count_b
#  (int) (int)   (int)   (int)
#1    42    NA       0       0
#2    42     1       1       0
#3    42     2       1       1
#4    42    NA       1       1
#5    42     1       2       1
#6    43    NA       0       0
#7    43     3       1       0
#8    43     4       1       1
#9    43     4       1       2

data

data <- structure(list(ID = c(42L, 42L, 42L, 42L, 42L, 43L, 43L, 43L, 
43L), Event = c(NA, 1L, 2L, NA, 1L, NA, 1L, 2L, 2L)), .Names = c("ID", 
"Event"), class = "data.frame", row.names = c(NA, -9L))

data2 <- structure(list(ID = c(42L, 42L, 42L, 42L, 42L, 43L, 43L, 43L, 
43L), Event = c(NA, 1L, 2L, NA, 1L, NA, 3L, 4L, 4L)), .Names = c("ID", 
"Event"), row.names = c(NA, -9L), class = "data.frame")