53
votes

note: this question and the following answers refer to data.table versions < 1.5.3; v. 1.5.3 was released in Feb 2011 to resolve this issue. see more recent treatment (03-2012): Translating SQL joins on foreign keys to R data.table syntax


I've been digging through the documentation for the data.table package (a replacement for data.frame that's much more efficient for certain operations), including Josh Reich's presentation on SQL and data.table at the NYC R Meetup (pdf), but can't figure this totally trivial operation out.

> x <- DT(a=1:3, b=2:4, key='a')
> x
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
> y <- DT(a=1:3, c=c('a','b','c'), key='a')
> y
     a c
[1,] 1 a
[2,] 2 b
[3,] 3 c
> x[y]
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
> merge(x,y)
  a b c
1 1 2 a
2 2 3 b
3 3 4 c

The docs say "When [the first argument] is itself a data.table, a join is invoked similar to base::merge but uses binary search on the sorted key." Clearly this is not the case. Can I get the other columns from y into the result of x[y] with data.tables? It seems like it's just taking the rows of x where the key matches the key of y, but ignoring the rest of y entirely...

4
This was resolved by v1.5.3 released to CRAN in Feb 2011. Please see it's NEWS, new ?data.table and corrected FAQ.Matt Dowle

4 Answers

29
votes

You are quoting the wrong part of documentation. If you have a look at the doc of [.data.table you will read:

When i is a data.table, x must have a key, meaning join i to x and return the rows in x that match. An equi-join is performed between each column in i to each column in x’s key in order. This is similar to base R functionality of sub- setting a matrix by a 2-column matrix, and in higher dimensions subsetting an n-dimensional array by an n-column matrix

I admit the description of the package (the part you quoted) is somewhat confusing, because it seems to say that the "["-operation can be used instead of merge. But I think what it says is: if x and y are both data.tables we use a join on an index (which is invoked like merge) instead of binary search.


One more thing:

The data.table library I installed via install.packages was missing the merge.data.table method, so using merge would call merge.data.frame. After installing the package from R-Forge R used the faster merge.data.table method.

You can check if you have the merge.data.table method by checking the output of:

methods(generic.function="merge")

EDIT [Answer no longer valid]: This answer refers to data.table version 1.3. In version 1.5.3 the behaviour of data.table changed and x[y] returns the expected results. Thank you Matthew Dowle, author of data.table, for pointing this out in the comments.

15
votes

Thanks for the answers. I missed this thread when it was originally posted. data.table has moved on since February. 1.4.1 was released to CRAN a while ago and 1.5 is out soon. For example the DT() alias has been replaced with list(); as a primitive its much faster, and data.table now inherits from data.frame so it works with packages that only accept data.frame such as ggplot and lattice, without any conversion required (faster and more convenient).

Is it possible to subscribe to the data.table tag so I get an email when someone posts a question with that tag? The datatable-help list has grown to about 30-40 messages a month, but I'm happy to answer here too if I can get some kind of notification.

Matthew

13
votes

I think using the base::merge function is not needed, as using data.table joins can be a lot faster. E.g. see the following. I make x and y data.tables with 3-3 columns:

x <- data.table( foo = 1:5, a=20:24, zoo = 5:1 )
y <- data.table( foo = 1:5, b=30:34, boo = 10:14)
setkey(x, foo)
setkey(y, foo)

And merge both with base:merge and data.table joins to see the speed of executions:

system.time(merge(x,y))
##    user  system elapsed 
##   0.027   0.000   0.023 

system.time(x[,list(y,x)])
##    user  system elapsed 
##   0.003   0.000   0.006 

The results are not identical, as the latter has one extra column:

merge(x,y)
##      foo  a zoo  b boo
## [1,]   1 20   5 30  10
## [2,]   2 21   4 31  11
## [3,]   3 22   3 32  12
## [4,]   4 23   2 33  13
## [5,]   5 24   1 34  14

x[,list(x,y)]
##      foo  a zoo foo.1  b boo
## [1,]   1 20   5     1 30  10
## [2,]   2 21   4     2 31  11
## [3,]   3 22   3     3 32  12
## [4,]   4 23   2     4 33  13
## [5,]   5 24   1     5 34  14

Which could not make a big trouble :)

3
votes

I think that f3lix is correct and that the documentation is a little misleading. The benefit is in doing a fast join to subset the data. You still ultimately need to use the merge function afterwards as in your above example.

You will see in Josh's presentation on using data.table that this is how his example runs. He first subsets one of the data.tables, then does a merge:

library(data.table)
sdt <- DT(series, key='series_id')
ddt <- DT(data, key='series_id')
u <- sdt[ grepl('^[A-Z]{2}URN', fred_id) & !grepl('DSURN', fred_id) ]
d <- ddt[ u, DT(min=min(value)), by='series_id', mult='all']
data <- merge(d,series)[,c('title','min','mean','max')]