1
votes

To come up with a column of counts for each level (or combination of levels) for categorical variables is data.table syntax can be handled with something like:

#setting up the data so it's pasteable
df <- data.table(var1 = c('dog','cat','dog','cat','dog','dog','dog'),
                 var2 = c(1,5,90,95,91,110,8),
                 var3 = c('lamp','lamp','lamp','table','table','table','table'))

#adding a count column for var1
df[, var1count := .N, by = .(var1)]

#adding a count of each combo of var1 and var3
df[, var1and3comb := .N, by = .(var1,var3)]

I am curious as to how I could instead produce a count column that counts the number of records with a value that is within +- 5 from each value of var2.

In my non-functioning attempt at this,

df[, var2withinrange := .N, by = .(between((var2-5),(var2+5),var2))]

I get a column with the total number of records as opposed to the desired result. I'd be hoping for the first row to hold a value of 2, since the 1 and 5 fall into that range. Row 2 should have a value of 3, since the 1, 5, and 8 all fall into that range for the 5, and so on.

Any help on coming up with a solution is much appreciated. Ideally in data.table code!

2

2 Answers

2
votes

A solution with :

df[, var2withinrange := df[.(var2min = var2 - 5, var2plus = var2 + 5)
                           , on = .(var2 >= var2min, var2 <= var2plus)
                           , .N
                           , by = .EACHI][, N]][]

which gives:

> df
   var1 var2  var3 var2withinrange
1:  dog    1  lamp               2
2:  cat    5  lamp               3
3:  dog   90  lamp               3
4:  cat   95 table               3
5:  dog   91 table               3
6:  dog  110 table               1
7:  dog    8 table               2
0
votes

The problem with your version is that it aggregates the boolean values. Every line will produce a FALSE as every value - 5 lies outside of its value + 5 and its value. If you would change it to

between(var2, var2 - 5, var2 + 5)

you would get TRUE back but still a 7 in every column because of the vectorized version.

Your problem can be solved with sapply which takes values by value and compares it to every value in the columns vectorized. This works but it is not mainly data.table code.

df$var2withinrange = sapply(df$var2, function(x){ sum(between(x, df$var2 - 5, df$var2 + 5)) })