5
votes

I have a data.table in R that contains multiple status values for each user collected at different time points. I want to compare the the status values at consecutive time points and update the rows with a flag whenever the status changes. Please see below for an example

DT_A <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("A","B","A","A","B","A","A"), Status2 = c("C","C","C","C","D","D","E"))
DT_A_Final <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("0","1","0","0","1","0","0"), Status2 = c("0","0","0","0","1","0","1"))

The original data table DT_A is

    sid date    Status1 Status2
1   1   2014-06-22  A   C
2   1   2014-06-23  B   C
3   2   2014-06-22  A   C
4   2   2014-06-23  A   C
5   2   2014-06-24  B   D
6   3   2014-06-22  A   D
7   3   2014-06-23  A   E

The final required data table is DT_A_final

    sid date    Status1 Status2
1   1   2014-06-22  0   0
2   1   2014-06-23  1   0
3   2   2014-06-22  0   0
4   2   2014-06-23  0   0
5   2   2014-06-24  1   1
6   3   2014-06-22  0   0
7   3   2014-06-23  0   1

Please help how I can I achieve this?

3

3 Answers

7
votes

Here is an option:

DT_A[, 
  c("S1Change", "S2Change") := 
    lapply(.SD, function(x) c(0, head(x, -1L) != tail(x, -1L))),
  .SDcols=c("Status1", "Status2"),   # .SD contains just these columns
  by=sid
]

Here, we create two new columns, which we populate by lapply over .SD (defined to contain just Status1 and Status2). The function compares all but the first value of a column to all but the last of the same column. This will return TRUE any time there is change in a column. We add 0 at the beginning since the first value is never a change; this also coerces the result to a numeric vector (thanks eddi).

Then, we just by by sid, and voila:

   sid       date Status1 Status2 S1Change S2Change
1:   1 2014-06-22       A       C        0        0
2:   1 2014-06-23       B       C        1        0
3:   2 2014-06-22       A       C        0        0
4:   2 2014-06-23       A       C        0        0
5:   2 2014-06-24       B       D        1        1
6:   3 2014-06-22       A       D        0        0
7:   3 2014-06-23       A       E        0        1

You can easily subset this to drop the original status columns if you want. It isn't possible to re-use them because the data type of the result is different than the original (numeric vs. character).

3
votes

A dplyr approach would also work here. Start by creating a function to compare all elements in a vector to the first element, and then apply this to all the "Status" variables:

library(dplyr)
library(magrittr)

equal_first <- function(x) {
  x %>% equals(x[1]) %>% not %>% as.numeric
}

DT_A %>%
  group_by(sid) %>%
  mutate_each(funs(equal_first),starts_with("Status"))
  sid       date Status1 Status2
1   1 2014-06-22       0       0
2   1 2014-06-23       1       0
3   2 2014-06-22       0       0
4   2 2014-06-23       0       0
5   2 2014-06-24       1       1
6   3 2014-06-22       0       0
7   3 2014-06-23       0       1

If you have more than one status change per user, you want to compare to the previous value, not the first:

equal_prev <- function(x) {
  x %>% equals(lag(x, default = x[1])) %>% not %>% as.numeric
}

DT_A %>%
  group_by(sid) %>%
  mutate_each(funs(equal_prev),starts_with("Status"))
1
votes

An approach using set

for(col in c('Status1','Status2')){
  ones <-  DT_A[, .I[1L]  ,by=c('sid',col)][,V1[-1L],by=sid][['V1']]

  set(DT_A, j=col,value='0')
  set(DT_A, j=col,i=ones,value='1')

}

Note I have retained Status1/Status2 as character variables, to create integer variables, use

for(col in c('Status1','Status2')){
  ones <-  DT_A[, .I[1L]  ,by=c('sid',col)][,V1[-1L],by=sid][['V1']]
  set(DT_A, j=col, value=NULL)
  set(DT_A, j=col,value=0L)
  set(DT_A, j=col,i=ones,value=1L)

}