5
votes

Say I have this dataset

test <- data.table(X = rep(1, 3), Y = rep("a", 3))

which gives us

test
#   X Y
#1: 1 a
#2: 1 a
#3: 1 a

I'm wondering why

test[, X[Y == "a"], keyby = .(X)]

gives

#   X V1
#1: 1  1
#2: 1 NA
#3: 1 NA

Thank you in advance for your answers!

3
Did you meant to do test[Y == 'a', .SD, keyby = .(X)] or test[, .SD[Y == "a"], keyby = .(X)] - akrun
Not sure why you want to use the grouping column to subset it because grouping column output a single element while the other Y == 'a', returns 3 and thus it is filled with NA (unless you replicate the X - akrun
It's standard R behavior for out-of-bounds indexing. See R Intro3.4.1 Indexing by vectors: "If i is positive and exceeds length(x) then the corresponding selection is NA", together with the fact that inside each group, the grouping variable is of length 1 (see FAQ 2.10) - Henrik
Indeed. The length of the logical index (i) is 3. The length of the grouping variable inside each group is 1 (x, the vector you try to index; again, see FAQ): "If i is positive (yes, here i is 3) and exceeds length(x) (yes it does, length index vector i is 3, length of vector to be indexed x is 1) then the corresponding selection is NA - Henrik
Side-note: thanks for posting such a small, illustrative toy data set! - Henrik

3 Answers

2
votes

If you run X and Y=="a" separately

> test[, X, keyby = .(X)]
   X X
1: 1 1

> test[, Y == "a", keyby = .(X)]
   X   V1
1: 1 TRUE
2: 1 TRUE
3: 1 TRUE

you will see that, the first one gives numeric value 1 of length 1, and the second one gives logical values TRUE of length 3.

Since you don't have matched lengths for subsetting, you will obtain NAs to fill in the corresponding places, e.g.,

> 1[rep(TRUE,3)]
[1]  1 NA NA
1
votes

It returns 2 in uniqueN because there are two values - 1) the 'X' grouping value 1 and the NA filled up. We could use na.rm = TRUE in uniqueN

test[, uniqueN(X[Y == "a"],  na.rm = TRUE), keyby = .(X)]
#   X V1
#1: 1  1

As mentioned in @ThomasIsCoding post, it the mismatch in length between the logical vector and the length of grouping variable (which returns length 1) cause the filling of additional TRUE positions with NA. An option would be to replicate

test[, rep(X, .N)[Y == "a"], keyby = .(X)]
#   X V1
#1: 1  1
#2: 1  1
#3: 1  1
1
votes

Well, its complicated, in a way.

It has to do with what X is inside a grouping.

Consider these variations:

description expression
Yours test[, X[Y == "a"], keyby=.(X) ]
X only test[, X, keyby=.(X) ]
Y=="a" only test[, Y == "a", keyby=.(X) ]

X only gives:


> test[, X, keyby=.(X) ]
   X X
1: 1 1

This is what 'X' is inside your grouping. Only that one value.

The third expression:


> test[, Y == "a", keyby=.(X) ]
   X   V1
1: 1 TRUE
2: 1 TRUE
3: 1 TRUE

There you see what Y == "a" looks like inside your grouping.

If you combine these, to do: X[ Y == "a" ] inside your grouping, you effectively do:


X <- 1
X[ c(TRUE,TRUE,TRUE) ]

X having only one value, but are asked to return the first, second and third values, will give you its one value and 2 NA's, which is what you see.