0
votes

I have a data.table in which I have records belonging to multiple groupings. I want to count the number of records that fall into the same group for two variables, where the grouping variables may include some NAs.

Example data below:

library(data.table)

mydt <- data.table(id = c(1,2,3,4,5,6), 
                   travel = c("no travel", "morocco", "algeria", 
                              "morocco", "morocco", NA), 
                   cluster = c(1,1,1,2,2,2))

> mydt
   id    travel cluster
1:  1 no travel       1
2:  2   morocco       1
3:  3   algeria       1
4:  4   morocco       2
5:  5   morocco       2
6:  6      <NA>       2


In the above example I want to calculate how many people travelled to each destination by cluster.

Initially I was doing this using the .N notation, as below:

mydt[, ndest1 := as.double(.N), by = c("cluster", "travel")]

> mydt
   id    travel cluster ndest1
1:  1 no travel       1     1
2:  2   morocco       1     1
3:  3   algeria       1     1
4:  4   morocco       2     2
5:  5   morocco       2     2
6:  6      <NA>       2     1

However, NAs are counted as a value - this doesn't work well for my purposes since I later want to identify which destination within each cluster the most people travelled to (morocco in cluster 2 above) using max(...) and if there are a lot of NAs in a given cluster, 'NA' will incorrectly be flagged as the most popular destination.

I then tried using sum() instead, as this is intuitive and also allows me to exclude NAs:

mydt[, ndest2 := sum(!is.na(travel)), by = c("cluster", "travel")]

> mydt
   id    travel cluster ndest1 ndest2
1:  1 no travel       1      1      1
2:  2   morocco       1      1      1
3:  3   algeria       1      1      1
4:  4   morocco       2      2      1
5:  5   morocco       2      2      1
6:  6      <NA>       2      1      0

This gives incorrect results - after a bit of further testing, it seems to be because I have used the same variable for the logic statement within sum(...) as one of the grouping variables in the by statement.

When I use a different variable I get the desired result except that I am not able to exclude NAs this way:

mydt[, ndest3 := sum(!is.na(id)), by = c("cluster", "travel")]

> mydt
   id    travel cluster ndest1 ndest2 ndest3
1:  1 no travel       1      1      1      1
2:  2   morocco       1      1      1      1
3:  3   algeria       1      1      1      1
4:  4   morocco       2      2      1      2
5:  5   morocco       2      2      1      2
6:  6      <NA>       2      1      0      1

This leads me to two questions:

  1. In a data.table conditional count, how do I exclude NAs?
  2. Why can't the same variable be used in the sum logic statemtent and as a grouping variable after by?

Any insights would be much appreciated.

1

1 Answers

2
votes

You can exclude NAs in i

mydt[!is.na(travel), ndest1 := .N, by = .(travel, cluster)][]
#   id    travel cluster ndest1
#1:  1 no travel       1      1
#2:  2   morocco       1      1
#3:  3   algeria       1      1
#4:  4   morocco       2      2
#5:  5   morocco       2      2
#6:  6      <NA>       2     NA