Compare consecutive rows in data.table and replace row values

Question

I have a data.table in R that contains multiple status values for each user collected at different time points. I want to compare the the status values at consecutive time points and update the rows with a flag whenever the status changes. Please see below for an example

DT_A <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("A","B","A","A","B","A","A"), Status2 = c("C","C","C","C","D","D","E"))
DT_A_Final <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("0","1","0","0","1","0","0"), Status2 = c("0","0","0","0","1","0","1"))

The original data table DT_A is

    sid date    Status1 Status2
1   1   2014-06-22  A   C
2   1   2014-06-23  B   C
3   2   2014-06-22  A   C
4   2   2014-06-23  A   C
5   2   2014-06-24  B   D
6   3   2014-06-22  A   D
7   3   2014-06-23  A   E

The final required data table is DT_A_final

    sid date    Status1 Status2
1   1   2014-06-22  0   0
2   1   2014-06-23  1   0
3   2   2014-06-22  0   0
4   2   2014-06-23  0   0
5   2   2014-06-24  1   1
6   3   2014-06-22  0   0
7   3   2014-06-23  0   1

Please help how I can I achieve this?

BrodieG BrodieG · Accepted Answer · 2014-06-24T19:47:11

Here is an option:

DT_A[, 
  c("S1Change", "S2Change") := 
    lapply(.SD, function(x) c(0, head(x, -1L) != tail(x, -1L))),
  .SDcols=c("Status1", "Status2"),   # .SD contains just these columns
  by=sid
]

Here, we create two new columns, which we populate by lapply over .SD (defined to contain just Status1 and Status2). The function compares all but the first value of a column to all but the last of the same column. This will return TRUE any time there is change in a column. We add 0 at the beginning since the first value is never a change; this also coerces the result to a numeric vector (thanks eddi).

Then, we just by by sid, and voila:

   sid       date Status1 Status2 S1Change S2Change
1:   1 2014-06-22       A       C        0        0
2:   1 2014-06-23       B       C        1        0
3:   2 2014-06-22       A       C        0        0
4:   2 2014-06-23       A       C        0        0
5:   2 2014-06-24       B       D        1        1
6:   3 2014-06-22       A       D        0        0
7:   3 2014-06-23       A       E        0        1

You can easily subset this to drop the original status columns if you want. It isn't possible to re-use them because the data type of the result is different than the original (numeric vs. character).

Compare consecutive rows in data.table and replace row values

3 Answers