I have the following data.table (which is much larger, but this shows this structure)
library(data.table)
dt = data.table(first_column = c("A", "B", "B", "B", "C", "A", "A", "A", "D", "B", "A", "A"), second_column =c(0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0))
> dt
first_column second_column
1: A 0
2: B 1
3: B 1
4: B 1
5: C 0
6: A 0
7: A 0
8: A 1
9: D 1
10: B 1
11: A 1
12: A 0
....
The resulting data.table should look like this:
first_column second_column identity percent
0 A 0 0 0
1 B 1 AB 1.0
2 B 1 AB 1.0
3 B 1 AB 1.0
4 C 0 0 0
5 A 0 0 0
6 A 0 0 0
7 A 1 mixed 0.75 # 3/4, 3-AB, 4-total
8 D 1 mixed 0.75
9 B 1 mixed 0.75
10 A 1 mixed 0.75
11 A 0 0 0
....
The column first_column
has for each row A, B, C, and D. In the second column, there is a binary label denoting a group of values. All consecutive groupings of 1's are a unique "group", e.g. rows 1-3 is one group, rows 7-10 is another group.
I would like to "label" each one of these groups by being either "AB" (the group is only composed of A or B), "CD" (the group is only composed of C or D), or "mixed" (if there is a mixture, e.g. all B and one C). It would also be useful to know "how" mixed some of these groupings are with a percentage, i.e. the percentage of AB's out of total labels. If it is only A or B, the identity should be AB. If it is only C or D, the identity should be CD. It is a mixture of A,B,C, and/or D, then it is mixed. The percentage is (# of AB rows)/(# of total rows)
For the identity
column, I suspect one could use a ifelse()
statement, e.g.
dt$identity = ifelse( ((dt$second_column == 1) & (dt$first_column == "A") & (dt$first_column == "B") & (dt$first_column != "C") & (dt$first_column != "D")), "AB", 0)
dt$identity = ifelse( ((dt$second_column == 1) & (dt$identity != "AB") & (dt$first_column == "C") & (dt$first_column == "D") & (dt$first_column != "A") & (dt$first_column != "B")), "CD", 0)
But I'm getting into logical ifelse()
difficulties defining the "mixed case". I'm also entirely unsure how to measure the percentage of AB/all rows.