R data.table: label "identity" of consecutive values based on values of other columns?

Question

I have the following data.table (which is much larger, but this shows this structure)

library(data.table)

dt = data.table(first_column = c("A", "B", "B", "B", "C", "A", "A", "A", "D", "B", "A", "A"), second_column =c(0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0))

> dt
    first_column second_column
 1:            A             0
 2:            B             1
 3:            B             1
 4:            B             1
 5:            C             0
 6:            A             0
 7:            A             0
 8:            A             1
 9:            D             1
10:            B             1
11:            A             1
12:            A             0
....

The resulting data.table should look like this:

  first_column  second_column    identity    percent
0             A              0           0          0
1             B              1          AB        1.0
2             B              1          AB        1.0
3             B              1          AB        1.0
4             C              0           0          0
5             A              0           0          0
6             A              0           0          0
7             A              1       mixed       0.75  # 3/4, 3-AB, 4-total
8             D              1       mixed       0.75
9             B              1       mixed       0.75
10            A              1       mixed       0.75
11            A              0           0          0 
....

The column first_column has for each row A, B, C, and D. In the second column, there is a binary label denoting a group of values. All consecutive groupings of 1's are a unique "group", e.g. rows 1-3 is one group, rows 7-10 is another group.

I would like to "label" each one of these groups by being either "AB" (the group is only composed of A or B), "CD" (the group is only composed of C or D), or "mixed" (if there is a mixture, e.g. all B and one C). It would also be useful to know "how" mixed some of these groupings are with a percentage, i.e. the percentage of AB's out of total labels. If it is only A or B, the identity should be AB. If it is only C or D, the identity should be CD. It is a mixture of A,B,C, and/or D, then it is mixed. The percentage is (# of AB rows)/(# of total rows)

For the identity column, I suspect one could use a ifelse() statement, e.g.

dt$identity = ifelse( ((dt$second_column == 1) & (dt$first_column == "A") & (dt$first_column == "B") & (dt$first_column != "C") & (dt$first_column != "D")), "AB", 0)   

dt$identity = ifelse( ((dt$second_column == 1) & (dt$identity != "AB") & (dt$first_column == "C") & (dt$first_column == "D") & (dt$first_column != "A") & (dt$first_column != "B")), "CD", 0)

But I'm getting into logical ifelse() difficulties defining the "mixed case". I'm also entirely unsure how to measure the percentage of AB/all rows.

Mike H. Mike H. · Accepted Answer · 2017-04-24T21:25:49

You could use rleid and data.table to get what you want. I use rleid to get the grouping variable and then I create the identity and percent variables based on your conditions for each group. Thanks to @DavidArenburg for the suggestions!

dt[,c("identity", "percent") := list(if(all(first_column == "A") | all(first_column == "B")) "AB"
                                      else if (all(first_column == "C") | all(first_column == "D")) "CD"
                                      else "mixed", sum(first_column %in% c("A","B")) / .N), by =  rleid(second_column)]
dt[second_column==0, c("identity","percent") := list("0", 0)]

dt
 #       first_column second_column identity percent
 #1:            A             0        0    0.00
 #2:            B             1       AB    1.00
 #3:            B             1       AB    1.00
 #4:            B             1       AB    1.00
 #5:            C             0        0    0.00
 #6:            A             0        0    0.00
 #7:            A             0        0    0.00
 #8:            A             1    mixed    0.75
 #9:            D             1    mixed    0.75
#10:            B             1    mixed    0.75
#11:            A             1    mixed    0.75
#12:            A             0        0    0.00

You could chain the calls but I didn't for readability.

R data.table: label "identity" of consecutive values based on values of other columns?

1 Answers