Usually, when you subset on a key column to take advantage of fast binary search based subset, you'd do:
DT[J(values)] # assuming subset here is on the first key column.
# (or)
DT[.(values)] # idem
Both .
and J
here do exactly the same. When your key column is of type character
, since you also have to quote the character value, data.table
also allows for a join if possible without the J
or .
, for convenience. That is,
DT["a"] # subset on the first key column if one exists
# (or)
DT[J("a")] # idem
# (or)
DT[.("a")] # idem
This facility is just on character vectors. It's possible because you can't subset a data.table
using character vector in i
by any other way. So, it's easy to tell that you're wanting a join. But if you provide DT[2]
, 2
here being numeric
, data.table
can't really tell if you're expecting a join or a normal row-subset. That's why it's just for characters.
Now, DT[J(.)]
will be fast because, when the key is set, it's already sorted and therefore we can subset using fast binary search. However, the case DT[x < .]
uses normal vector scan approach. That is, it'll check all the values of x
for the value a
, even if the values are sorted by x
. Therefore, the second one will be slower than the first.
There are feature requests to enable binary search based subset on ranges. You've have a look here. Once it's implemented, these things will automatically get faster. We've not gotten to it just yet.
HTH.
PS: Note that you're comparing DT["2"]
- which is a character key column based binary search subset against DT[key < 2]
where key
is numeric here. They're not the same. The equivalent (as I explained above) is DT[J(2)]
.
Also note that they are not equivalent operations. DT[J(2)]
looks only for key column that matches 2
in DT, where as DT[key < 2]
looks for all values in the range [min[key], 2)
.
DT[2]
is NOT the same asDT[key==2]
! And if you really set the key, your second version should be extremely fast. - shadowDT[2]
does a binary search(fast), whileDT[key==2]
does a vector scan(slow). However I get 0.03 seconds forDT[2]
and 0.5 seconds forDT[key<2]
. Is it possible to have a fastDT[key<2]
? - misha_dodicDT[2]
gives second row not the row wherekey == 2
- CHP