1
votes

Part 1 I have the following data table. I want to make a new column which contains the number of occurrences of each id where there is any style value, except for NA. The main problem is that I don't know how to deal with the NA. Currently, when NA is present I get a frequency of 1.

id    style   
1     A    
1     A    
2     A    
2     B    
3     NA    
4     A    
4     C   
5     NA

I tried using the following, but it still counts NA values

dt[, allele_count := .N, by = list(pat_id, style)]

The desired data table would be as follows:

id    style   count
1     A       2
1     A       2
2     A       2
2     B       2
3     NA      0
4     A       4
4     B       4
4     B       4
4     C       4
5     NA      0

Part2 I would also like to be able to add another column which would have the number of occurrences of each id with a certain style value.

id    style   count2
1     A       2
1     A       2
2     A       1
2     B       1
3     NA      0
4     A       1
4     B       2
4     B       2
4     C       1
5     NA      0

Bonus Question: Instead of looking at how many times an id occurs with a given style value as in Part2 , How can you calculate the number of different style values for each id, as follows.

id    style   count3
1     A       1
1     A       1
2     A       2
2     B       2
3     NA      0
4     A       3
4     B       3
4     B       3
4     C       3
5     NA      0
2
No sure about the first count... what if, for instance, there's another id = 3 with style B, can you add the expected result with this example ? (also check your initial data.table since it doesn't have four id=4...)digEmAll
The expected output does not match the input given.Pierre L

2 Answers

3
votes

Here's one possibility. Basically we use a row subset to assign the new columns, then replace the NA values in all three new columns with zero at the end.

nna <- !is.na(dt$style) ## so we don't have to call it four times
dt[nna, count := .N, by = id][nna, count2 := .N, by = .(id, style)][
    nna, count3 := uniqueN(style), by = id][!nna, names(dt)[3:5] := 0L]

which results in

   id style count count2 count3
1:  1     A     2      2      1
2:  1     A     2      2      1
3:  2     A     2      1      2
4:  2     B     2      1      2
5:  3    NA     0      0      0
6:  4     A     2      1      2
7:  4     C     2      1      2
8:  5    NA     0      0      0

Or you can simplify this down to the following, then reorder the columns if necessary.

dt[nna, c("count", "count3") := .(.N, uniqueN(style)), by = id][
    nna, count2 := .N, by = .(id, style)][!nna, names(dt)[3:5] := 0L]

Note that this method is very similar to the other posted answer. I am not sure which of the two is the preferred method, row subset or if() statement.

2
votes

This is a possible solution :

library(data.table)

dt <- data.table(id=c(1,1,2,2,3,4,4,5),
                 style=c('A','A','A','B',NA,'A','C',NA))

# count = number of ids having ALL styles defined
dt[, count := if(any(is.na(style))) 0L else .N, by = id]
# count2 = number of id-style occurrences (0 if style = NA)
dt[, count2 := if(is.na(style)) 0L else .N, by = .(id, style)]


> dt
   id style count count2
1:  1     A     2      2
2:  1     A     2      2
3:  2     A     2      1
4:  2     B     2      1
5:  3    NA     0      0
6:  4     A     2      1
7:  4     C     2      1
8:  5    NA     0      0

BONUS:

dt[, count3 := uniqueN(na.omit(style)), by = id]

> dt
   id style count count2 count3
1:  1     A     2      2      1
2:  1     A     2      2      1
3:  2     A     2      1      2
4:  2     B     2      1      2
5:  3    NA     0      0      0
6:  4     A     2      1      2
7:  4     C     2      1      2
8:  5    NA     0      0      0