data.table v1.9.7+ has implemented non-equi joins and added a new function inrange
which uses this new feature and can achieve what you want
## Loading data
library(data.table) #v 1.9.7+
DT <- data.table(date = c('01.02.2016 10:00:00','01.02.2016 10:01:00',
'01.02.2016 10:02:00','01.02.2016 10:03:00',
'01.02.2016 10:04:00'),
column1 = c(10, 12, 13, 11, 17))
## Converting to POSIXct class
DT[, date := as.POSIXct(date, format = "%d.%m.%Y %H:%M:%S")]
## Validating that forder/bmerge kicks in
options(datatable.verbose = TRUE)
DT[date %inrange% as.POSIXct(c("2016-02-01 10:00:30", "2016-02-01 10:02:30"))]
# forderv(query) took ... 0 secs
# Starting bmerge ...done in 0 secs <~~~~~~~~ (Thanks to @Arun for fixing the bug)
# Generating final logical vector ... done in 0 secs
# date column1
# 1: 2016-02-01 10:01:00 12
# 2: 2016-02-01 10:02:00 13
Though, you should be aware that since data.table 1.9.4 secondary keys were implemented, meaning that for some variations of vector scans, after the first run a key is being added and from now on, even operations such as ==
and %in%
are using bmerge
. This doesn't seem to work on POSIXct
class but you can observe this behavior on you numeric column column1
## Running for first time
options(datatable.verbose = TRUE)
DT[column1 == 10]
# Creating new index 'column1'
# forder took 0 sec <~~~ forder kicks in, hence first time is a bit slow
# Starting bmerge ...done in 0 secs
# date column1
# 1: 2016-02-01 10:00:00 10
## Running for second time and on
DT[column1 == 10]
# Using existing index 'column1'
# Starting bmerge ...done in 0 secs <~~ bmerge kicks in from now on
# date column1
# 1: 2016-02-01 10:00:00 10
As Mentioned by @Jan, this is planned to be implemented for non-equi joins too starting from v2.0.0
Edit (26/8/2016):
As noted by @Arun, although inrange
is using a binary join, it needs to sort the whole vector first in order to check whether each value in x is in between any of the intervals provided in lower, upper.
In your case it is a little of an overhead because you are only comparing against two values, hence the recently rewritten in C between
function will suit you better
set.seed(123)
DT <- data.table(x = sample(5e8))
system.time(res1 <- DT[x > 1e3L & x < 1e5L])
# user system elapsed
# 10.23 1.22 11.45
system.time(res2 <- DT[x %inrange% c(1e3L, 1e5L)])
# forderv(query) took ... 29.09 secs
# Starting bmerge ...done in 0 secs
# Generating final logical vector ... done in 0.43 secs
# user system elapsed
# 29.28 0.70 30.06
system.time(res3 <- DT[x %between% c(1e3L, 1e5L)])
# user system elapsed
# 2.01 2.60 0.84
So as you can see, although bmerge
is almost instant, the sorting takes a lot of time. While between
is the fastest because it doesn't need to convert x
to a logical vector twice. Heck, between
is so fast that elapsed
is somehow smaller that user
+ system
Though, if your data is already sorted, then inrange
catches up pretty well
setorder(DT, x)
system.time(res1 <- DT[x > 1e3L & x < 1e5L])
# user system elapsed
# 10.41 1.02 11.45
system.time(res2 <- DT[x %inrange% c(1e3L, 1e5L)])
# forderv(query) took ... 2.17 secs
# Starting bmerge ...done in 0 secs
# Generating final logical vector ... done in 0.44 secs
# user system elapsed
# 2.47 0.71 3.20
system.time(res3 <- DT[x %between% c(1e3L, 1e5L)])
# user system elapsed
# 2.30 2.62 0.88
"01.02.2016 10:00:30" < "21.02.2015 10:02:30"
returns, for example – talat