14
votes

I am trying to do some left-join merges with data.tables. The package description quote that

In all joins the names of the columns are irrelevant; the columns of x's key are joined to in order

I understand that I can use .data.table[ and data.table:::merge.data.table

What I would like is : merge X and Y specifying the keys (like by.x and by.y in base merge, ->why taking this away ?)

Let's suppose I have

DT = data.table(x=rep(c("a","b","c"),each=3),y=c(1,3,6),v=1:9,key="x,y,v")
DT1 = data.frame(x1=c("aa","bb","cc"),y1=c(1,3,6),v1=1:3,key="x1,y1,v1")

and I would like this output:

#data.table:::merge is masking I don't know how to call the base version of merge anymore
R) {base::merge}(DT,DT1,by.x="y",by.y="y1") 
y x v x1 v1
1 1 a 1 aa  1
2 1 c 7 aa  1
3 1 b 4 aa  1
4 3 a 2 bb  2
5 3 b 5 bb  2
6 3 c 8 bb  2
7 6 b 6 cc  3
8 6 a 3 cc  3
9 6 c 9 cc  3

I am very happy to use [ or data.table:::merge but I would like an option that do not modify DT or DT1 (like changing the column names and calling merge and changing it back)

3
merge.data.table is a method for the S3 generic base function merge. To call the base merge, merge.data.frame(DT,DT1,by.x="y",by.y="y1") should work. But see my answer too.Matt Dowle
Btw, I just noticed your two attempts to post to datatable-help from Nabble. They haven't got through (see yellow band at the top in Nabble) because you need to subscribe to datatable-help first. Nabble asks you "are you a member of the list you're trying to post to" at the point of posting. It's easy and automatic to join. It's just a spam prevention measure.Matt Dowle
I know this is old, but shouldn't it be possible to get around this by renaming one of the columns in the data.table to match the other data.table? I've tried this using setnames and come up with an error, but I don't see why it shouldn't work.willwest

3 Answers

9
votes

Update: Since data.table v1.9.6 (released September 19, 2015), merge.data.table() does accept and nicely handles arguments by.x= and by.y=. Here's an updated link to the FR (now closed) referenced below.


Yes this is a feature request not yet implemented :

FR#2033 Add by.x and by.y to merge.data.table

There isn't anything preventing it. Just something that wasn't done. I very rarely need merge and was slow to realise its usefulness more generally. We've made good progress in bringing merge performance as fast as X[Y], and this feature request is at the highest priority. If you'd like it more quickly you are more than welcome to add those arguments to merge.data.table and commit the change yourself. We try to keep source code short and together in one function/file, so by looking at merge.data.table source hopefully you can follow it and see what needs to be done.

5
votes

The arguments by.x and by.y are now available in the development version of data.table. See here. Use devtools::install_github("Rdatatable/data.table", build_vignettes = FALSE) to install the development version of data.table.

4
votes

You can't because the by columns must be in the intersection of colnames(DT) and colnames(DT1)

 if (!all(by %in% intersect(colnames(x), colnames(y)))) {
       stop("Elements listed in `by` must be valid column names in x and y")
   }

Here using setnames , which which does not copy and is very fast

setnames(DT1,'y1','y')
> merge(DT,DT1)
   y x v x1 v1
1: 1 a 1 aa  1
2: 1 b 4 aa  1
3: 1 c 7 aa  1
4: 3 a 2 bb  2
5: 3 b 5 bb  2
6: 3 c 8 bb  2
7: 6 a 3 cc  3
8: 6 b 6 cc  3
9: 6 c 9 cc  3

EDIT update with data.table version data.table 1.9.4

you should set the by parameter otherwise you get an error:

Error in merge.data.table(DT, as.data.table(DT1)) : 
  Elements listed in `by` must be valid column names in x and y

You should do something like :

merge(DT,DT1,by="y")