Frequency of two columns counting NAs in one column as zero frequency

Question

Part 1 I have the following data table. I want to make a new column which contains the number of occurrences of each id where there is any style value, except for NA. The main problem is that I don't know how to deal with the NA. Currently, when NA is present I get a frequency of 1.

id    style   
1     A    
1     A    
2     A    
2     B    
3     NA    
4     A    
4     C   
5     NA

I tried using the following, but it still counts NA values

dt[, allele_count := .N, by = list(pat_id, style)]

The desired data table would be as follows:

id    style   count
1     A       2
1     A       2
2     A       2
2     B       2
3     NA      0
4     A       4
4     B       4
4     B       4
4     C       4
5     NA      0

Part2 I would also like to be able to add another column which would have the number of occurrences of each id with a certain style value.

id    style   count2
1     A       2
1     A       2
2     A       1
2     B       1
3     NA      0
4     A       1
4     B       2
4     B       2
4     C       1
5     NA      0

Bonus Question: Instead of looking at how many times an id occurs with a given style value as in Part2 , How can you calculate the number of different style values for each id, as follows.

id    style   count3
1     A       1
1     A       1
2     A       2
2     B       2
3     NA      0
4     A       3
4     B       3
4     B       3
4     C       3
5     NA      0

No sure about the first count... what if, for instance, there's another id = 3 with style B, can you add the expected result with this example ? (also check your initial data.table since it doesn't have four id=4...) — digEmAll

Rich Scriven Rich Scriven · Accepted Answer · 2015-10-18T16:58:17

Here's one possibility. Basically we use a row subset to assign the new columns, then replace the NA values in all three new columns with zero at the end.

nna <- !is.na(dt$style) ## so we don't have to call it four times
dt[nna, count := .N, by = id][nna, count2 := .N, by = .(id, style)][
    nna, count3 := uniqueN(style), by = id][!nna, names(dt)[3:5] := 0L]

which results in

   id style count count2 count3
1:  1     A     2      2      1
2:  1     A     2      2      1
3:  2     A     2      1      2
4:  2     B     2      1      2
5:  3    NA     0      0      0
6:  4     A     2      1      2
7:  4     C     2      1      2
8:  5    NA     0      0      0

Or you can simplify this down to the following, then reorder the columns if necessary.

dt[nna, c("count", "count3") := .(.N, uniqueN(style)), by = id][
    nna, count2 := .N, by = .(id, style)][!nna, names(dt)[3:5] := 0L]

Note that this method is very similar to the other posted answer. I am not sure which of the two is the preferred method, row subset or if() statement.

Frequency of two columns counting NAs in one column as zero frequency

2 Answers