4
votes

I'm using data.table for fast subsetting. However whan I subset not based on keys equal to a value but smaller than , it takes a lot of time. For example:

DT["2"]

is fast, while

DT[key<2]

is slow.

I assume the first is a binary search and the second a vector scan, but how to do the second in a fast way?

Thanks for answers.

1
DT[2] is NOT the same as DT[key==2]! And if you really set the key, your second version should be extremely fast. - shadow
I know that DT[2] does a binary search(fast), while DT[key==2] does a vector scan(slow). However I get 0.03 seconds for DT[2] and 0.5 seconds for DT[key<2]. Is it possible to have a fast DT[key<2]? - misha_dodic
You need to give more information about your data. How big is the data. Have you set key or not? Btw - DT[2] gives second row not the row where key == 2 - CHP

1 Answers

4
votes

Usually, when you subset on a key column to take advantage of fast binary search based subset, you'd do:

DT[J(values)] # assuming subset here is on the first key column.
# (or)
DT[.(values)] # idem

Both . and J here do exactly the same. When your key column is of type character, since you also have to quote the character value, data.table also allows for a join if possible without the J or ., for convenience. That is,

DT["a"]       # subset on the first key column if one exists
# (or)
DT[J("a")]    # idem
# (or)
DT[.("a")]    # idem

This facility is just on character vectors. It's possible because you can't subset a data.table using character vector in i by any other way. So, it's easy to tell that you're wanting a join. But if you provide DT[2], 2 here being numeric, data.table can't really tell if you're expecting a join or a normal row-subset. That's why it's just for characters.

Now, DT[J(.)] will be fast because, when the key is set, it's already sorted and therefore we can subset using fast binary search. However, the case DT[x < .] uses normal vector scan approach. That is, it'll check all the values of x for the value a, even if the values are sorted by x. Therefore, the second one will be slower than the first.

There are feature requests to enable binary search based subset on ranges. You've have a look here. Once it's implemented, these things will automatically get faster. We've not gotten to it just yet.

HTH.

PS: Note that you're comparing DT["2"] - which is a character key column based binary search subset against DT[key < 2] where key is numeric here. They're not the same. The equivalent (as I explained above) is DT[J(2)].

Also note that they are not equivalent operations. DT[J(2)] looks only for key column that matches 2 in DT, where as DT[key < 2] looks for all values in the range [min[key], 2).